Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However, this and other Code LLM benchmarks face critical limitations, particularly in task diversity, test coverage, and linguistic scope. Current evaluations primarily focus on English-to-Python conversion tasks with limited test cases, potentially overestimating model performance. While recent works have addressed test coverage and programming language (PL) diversity, code generation from low-resource language prompts remains largely unexplored. To address this gap, we introduce mHumanEval, an extended benchmark supporting prompts in over 200 natural languages. We employ established machine translation methods to compile the benchmark, coupled with a quality assurance process. Furthermore, we provide expert human translations for 15 diverse natural languages (NLs). We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs, offering insights into the current landscape of cross-lingual code generation.

本研究解决了当前代码生成基准在任务多样性、测试覆盖率和语言范围上的局限，尤其是低资源语言的代码生成尚未得到充分探讨。通过引入mHumanEval，一个支持200多种自然语言提示的扩展基准，研究团队采用了成熟的机器翻译方法和质量保障流程，显著提升了多语言代码生成能力的评估。最终分析揭示了当前跨语言代码生成的状态，推动了该领域的发展。

mHumanEval -- 用于代码生成的大型语言模型评估的多语言基准