BriefGPT.xyz
Mar, 2024
评估大型语言模型的程序执行运行时行为
Evaluating Large Language Models with Runtime Behavior of Program Execution
HTML
PDF
Junkai Chen, Zhiyuan Pan, Xing Hu, Zhenhao Li, Ge Li...
TL;DR
本文提出了一个名为REval的框架,用于评估代码LLM的代码推理能力和一致性,通过对现有的代码基准进行改进,在大规模的实证研究中发现大多数LLMs在运行时行为推理和增量一致性评估方面表现不尽人意,强调了提高代码LLM的代码推理能力的迫切需求。
Abstract
Large language models for code (i.e.,
code llms
) have shown strong code understanding and generation capabilities. To evaluate the capabilities of
code llms
in various aspects, many
→