Recent advancements in large language models (LLMs) have greatly improved code generation, specifically at the function level. For instance, GPT-4 has achieved an 88.4% pass rate on HumanEval. However, this draws into question the adequacy of existing benchmarks in thoroughly assessing function-level code generation capabilities. Our study analyzed two common benchmarks, HumanEval and MBPP, and found that these might not thoroughly evaluate LLMs' code generation capacities due to limitations in quality, difficulty, and granularity. To resolve this, we introduce the Mostly Hard Python Problems (MHPP) dataset, consisting of 140 unique human-curated problems. By focusing on the combination of natural language and code reasoning, MHPP gauges LLMs' abilities to comprehend specifications and restrictions, engage in multi-step reasoning, and apply coding knowledge effectively. Initial evaluations of 22 LLMs using MHPP showed many high-performing models on HumanEval failed to achieve similar success on MHPP. Moreover, MHPP highlighted various previously undiscovered limitations within various LLMs, leading us to believe that it could pave the way for a better understanding of LLMs' capabilities and limitations. Dataset and code are available at https://github.com/SparksofAGI/MHPP.

最近大型语言模型（LLMs）在代码生成方面有了显著进展，但现有的基准测试无法全面评估LLMs在函数级代码生成能力方面的充分性。通过分析两个常见的基准测试（HumanEval和MBPP），我们的研究发现由于质量、难度和细度的限制，这些测试可能无法彻底评估LLMs的代码生成能力。因此，我们引入了“Mostly Hard Python Problems”（MHPP）数据集，包含140个独特的人类策划问题。通过将自然语言和代码推理相结合，MHPP评估了LLMs理解规范和限制、进行多步推理以及有效应用编码知识的能力。对22个LLMs使用MHPP的初步评估显示，在HumanEval上表现良好的模型在MHPP上往往无法取得类似的成功。此外，MHPP突显出各种以前未被发现的LLMs的限制，让我们相信它能为更好地理解LLMs的能力和限制铺平道路。数据集和代码可在此链接获取。

MHPP: 探索语言模型在基本代码生成之外的能力和局限性