How to evaluate the coding abilities of Large Language Models (LLMs) remains
an open question. We find that existing benchmarks are poorly aligned with
real-world code repositories and are insufficient to evaluate the coding
abilities of LLMs.
To address the knowledge gap, we propose a new benchmark named DevEval, which
has three advances. (1) DevEval aligns with real-world repositories in multiple
dimensions, e.g., code distributions and dependency distributions. (2) DevEval
is annotated by 13 developers and contains comprehensive annotations (e.g.,
requirements, original repositories, reference code, and reference
dependencies). (3) DevEval comprises 1,874 testing samples from 117
repositories, covering 10 popular domains (e.g., Internet, Database). Based on
DevEval, we propose repository-level code generation and evaluate 8 popular
LLMs on DevEval (e.g., gpt-4, gpt-3.5, StarCoder 2, DeepSeek Coder, CodeLLaMa).
Our experiments reveal these LLMs' coding abilities in real-world code
repositories. For example, in our experiments, the highest Pass@1 of
gpt-4-turbo is only 53.04%. We also analyze LLMs' failed cases and summarize
their shortcomings. We hope DevEval can facilitate the development of LLMs in
real code repositories. DevEval, prompts, and LLMs' predictions have been
released.

通过新的基准测试 DevEval，我们评估了 8 种流行的大型语言模型在真实代码库中的编码能力，并发现这些模型的编码能力在真实世界的代码库中存在缺陷。

DevEval：与现实世界源代码仓库对齐的手动注释代码生成基准

DevEval: A Manually-Annotated Code Generation Benchmark Aligned with  Real-World Code Repositories

How to evaluate Large Language Models (LLMs) in code generation is an open
question. Many benchmarks have been proposed but are inconsistent with
practical software projects, e.g., unreal program distributions, insufficient
dependencies, and small-scale project contexts. Thus, the capabilities of LLMs
in practical projects are still unclear. In this paper, we propose a new
benchmark named DevEval, aligned with Developers' experiences in practical
projects. DevEval is collected through a rigorous pipeline, containing 2,690
samples from 119 practical projects and covering 10 domains. Compared to
previous benchmarks, DevEval aligns to practical projects in multiple
dimensions, e.g., real program distributions, sufficient dependencies, and
enough-scale project contexts. We assess five popular LLMs on DevEval (e.g.,
gpt-4, gpt-3.5-turbo, CodeLLaMa, and StarCoder) and reveal their actual
abilities in code generation. For instance, the highest Pass@1 of gpt-3.5-turbo
only is 42 in our experiments. We also discuss the challenges and future
directions of code generation in practical projects. We open-source DevEval and
hope it can facilitate the development of code generation in practical
projects.

通过提出一个与开发者在实践项目中的经验相一致的新基准 DevEval，我们评估了五个热门的大型语言模型在代码生成方面的实际能力，揭示了它们的实际表现，并讨论了在实践项目中代码生成的挑战和未来发展方向。