Large language models (LLMs) have manifested strong ability to generate codes
for productive activities. However, current benchmarks for code synthesis, such
as HumanEval, MBPP, and DS-1000, are predominantly oriented towards
introductory tasks on algorithm and data science, insufficiently satisfying
challenging requirements prevalent in real-world coding. To fill this gap, we
propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror
the complexity and variety of scenarios in real coding tasks. NCB comprises 402
high-quality problems in Python and Java, meticulously selected from natural
user queries from online coding services, covering 6 different domains. Noting
the extraordinary difficulty in creating testing cases for real-world queries,
we also introduce a semi-automated pipeline to enhance the efficiency of test
case construction. Comparing with manual solutions, it achieves an efficiency
increase of more than 4 times. Our systematic experiments on 39 LLMs find that
performance gaps on NCB between models with close HumanEval scores could still
be significant, indicating a lack of focus on practical code synthesis
scenarios or over-specified optimization on HumanEval. On the other hand, even
the best-performing GPT-4 is still far from satisfying on NCB. The evaluation
toolkit and development set are available at
this https URL

大型语言模型在生产性活动的代码生成方面表现出强大的能力。然而，当前的代码合成基准主要面向算法和数据科学的入门任务，在真实世界的编码中对具有挑战性的要求不够满足。为了填补这一差距，我们提出了 NaturalCodeBench（NCB）作为一个具有挑战性的代码基准，旨在模拟真实编码任务的复杂性和多样性。NCB 由来自在线编码服务的自然用户查询中精心挑选的 402 个高质量问题组成，涵盖了 6 个不同领域。我们还引入了半自动化流程来提高测试用例构建的效率，相比手动解决方案，效率提高了 4 倍以上。我们对 39 个大型语言模型进行了系统实验，发现在 NCB 上，具有接近 HumanEval 评分的模型之间的性能差距仍然可能很大，表明对实际代码合成场景的关注不足或在 HumanEval 上过度优化。另一方面，即使是表现最佳的 GPT-4 在 NCB 上仍然远未令人满意。评估工具和开发集可在此 URL 获取。