We present MBXP, an execution-based code completion benchmark in 10+ programming languages. This collection of datasets is generated by our conversion framework that translates prompts and test cases from the original MBPP dataset to the corresponding data in a target language. Based on this benchmark, we are able to evaluate code generation models in a multi-lingual fashion, and in particular discover generalization ability of language models on out-of-domain languages, advantages of large multi-lingual models over mono-lingual, benefits of few-shot prompting, and zero-shot translation abilities. In addition, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages. These solutions can be used for other code-related evaluations such as insertion-based, summarization, or code translation tasks where we demonstrate results and release as part of our benchmark.

本文提出了新的基准测试，包括MBXP，Multilingual HumanEval和MathQA-X，以测试多语言环境下代码生成模型的性能，并发现了多语言模型的优势，以及通过few-shot prompting实现对模型新语言的教学能力和在单语言环境下的zero-shot translation能力。同时，作者还利用其代码生成模型在多种语言上实现了大规模引导过程，产生了其他与代码相关的评估任务中使用的合成规范解决方案。

代码生成模型的多语言评估