Program synthesis has been long studied with recent approaches focused on
directly using the power of Large Language Models (LLMs) to generate code
according to user intent written in natural language. Code evaluation datasets,
containing curated synthesis problems with input/output test-cases, are used to
measure the performance of various LLMs on code synthesis. However, test-cases
in these datasets can be limited in both quantity and quality for fully
assessing the functional correctness of the generated code. Such limitation in
the existing benchmarks begs the following question: In the era of LLMs, is the
code generated really correct? To answer this, we propose EvalPlus -- a code
synthesis benchmarking framework to rigorously evaluate the functional
correctness of LLM-synthesized code. In short, EvalPlus takes in the base
evaluation dataset and uses an automatic input generation step to produce and
diversify large amounts of new test inputs using both LLM-based and
mutation-based input generators to further validate the synthesized code. We
extend the popular HUMANEVAL benchmark and build HUMANEVAL+ with 81x
additionally generated tests. Our extensive evaluation across 14 popular LLMs
demonstrates that HUMANEVAL+ is able to catch significant amounts of previously
undetected wrong code synthesized by LLMs, reducing the pass@k by 15.1% on
average! Moreover, we even found several incorrect ground-truth implementations
in HUMANEVAL. Our work not only indicates that prior popular code synthesis
evaluation results do not accurately reflect the true performance of LLMs for
code synthesis but also opens up a new direction to improve programming
benchmarks through automated test input generation.

使用 EvalPlus 框架对大型语言模型进行代码综合基准测试，通过自动生成测试输入来扩充现有基准测试集，发现并降低了 LLM 合成代码的错误率，揭示了现有编程基准测试的局限性并为编程基准测试的改进方向开辟了新的方向。