Recent advancements in large language models (LLMs) have greatly improved
code generation, specifically at the function level. For instance, GPT-4 has
achieved an 88.4% pass rate on HumanEval. However, this draws into question the
adequacy of existing benchmarks in thoroughly assessing function-level code
generation capabilities. Our study analyzed two common benchmarks, HumanEval
and MBPP, and found that these might not thoroughly evaluate LLMs' code
generation capacities due to limitations in quality, difficulty, and
granularity. To resolve this, we introduce the Mostly Hard Python Problems
(MHPP) dataset, consisting of 140 unique human-curated problems. By focusing on
the combination of natural language and code reasoning, MHPP gauges LLMs'
abilities to comprehend specifications and restrictions, engage in multi-step
reasoning, and apply coding knowledge effectively. Initial evaluations of 22
LLMs using MHPP showed many high-performing models on HumanEval failed to
achieve similar success on MHPP. Moreover, MHPP highlighted various previously
undiscovered limitations within various LLMs, leading us to believe that it
could pave the way for a better understanding of LLMs' capabilities and
limitations. Dataset and code are available at
this https URL

最近大型语言模型（LLMs）在代码生成方面有了显著进展，但现有的基准测试无法全面评估 LLMs 在函数级代码生成能力方面的充分性。通过分析两个常见的基准测试（HumanEval 和 MBPP），我们的研究发现由于质量、难度和细度的限制，这些测试可能无法彻底评估 LLMs 的代码生成能力。因此，我们引入了 “Mostly Hard Python Problems”（MHPP）数据集，包含 140 个独特的人类策划问题。通过将自然语言和代码推理相结合，MHPP 评估了 LLMs 理解规范和限制、进行多步推理以及有效应用编码知识的能力。对 22 个 LLMs 使用 MHPP 的初步评估显示，在 HumanEval 上表现良好的模型在 MHPP 上往往无法取得类似的成功。此外，MHPP 突显出各种以前未被发现的 LLMs 的限制，让我们相信它能为更好地理解 LLMs 的能力和限制铺平道路。数据集和代码可在此链接获取。