代码生成评估的基准和指标：一项关键性回顾

Jun, 2024

代码生成评估的基准和指标：一项关键性回顾

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

Debalina Ghosh Paul, Hong Zhu, Ian Bayley

TL;DR对大型语言模型在编程任务中的评估工作进行了关键综述，着重讨论了现有工具的评估中使用的基准和度量标准，并提出了进一步研究的方向。

Abstract

With the rapid development of Large Language Models (LLMs), a large number of machine learning models have been developed to assist programming tasks including the generation of program code from natural language