Large language models (LLMs) have made significant progress in generating
codes from textual prompts. However, existing benchmarks have mainly
concentrated on translating English prompts to multilingual codes or have been
constrained to very limited natural languages (NLs). These benchmarks have
overlooked the vast landscape of massively multilingual NL to multilingual
code, leaving a critical gap in the evaluation of multilingual LLMs. In
response, we introduce HumanEval-XL, a massively multilingual code generation
benchmark specifically crafted to address this deficiency. HumanEval-XL
establishes connections between 23 NLs and 12 programming languages (PLs), and
comprises of a collection of 22,080 prompts with an average of 8.33 test cases.
By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a
comprehensive evaluation platform for multilingual LLMs, allowing the
assessment of the understanding of different NLs. Our work serves as a
pioneering step towards filling the void in evaluating NL generalization in the
area of multilingual code generation. We make our evaluation code and data
publicly available at https://github.com/FloatAI/HumanEval-XL.

使用人工评估的大规模多语言代码生成基准，填补了在多语言代码生成领域中评估自然语言泛化能力的空白。

HumanEval-XL：一种面向跨语言自然语言通用性的多语言代码生成评估基准

HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual  Natural Language Generalization

Research in massively multilingual image captioning has been severely
hampered by a lack of high-quality evaluation datasets. In this paper we
present the Crossmodal-3600 dataset (XM3600 in short), a geographically diverse
set of 3600 images annotated with human-generated reference captions in 36
languages. The images were selected from across the world, covering regions
where the 36 languages are spoken, and annotated with captions that achieve
consistency in terms of style across all languages, while avoiding annotation
artifacts due to direct translation. We apply this benchmark to model selection
for massively multilingual image captioning models, and show superior
correlation results with human evaluations when using XM3600 as golden
references for automatic metrics.

本文提出了 Crossmodal-3600 数据集，其中包含 3600 张图片，涵盖了 36 种语言中所使用的地区，并使用人工参考标题对其进行了注释。该数据集被应用于大规模多语言图片字幕模型的选择，并在使用 XM3600 作为自动度量的黄金参考时，展示出与人工评估更高的相关性结果。

跨媒体 - 3600：一款大规模多语言多模态评估数据集

Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset

In this study, we tackle massively multilingual grapheme-to-phoneme
conversion through implementing G2P models based on ByT5. We have curated a G2P
dataset from various sources that covers around 100 languages and trained
large-scale multilingual G2P models based on ByT5. We found that ByT5 operating
on byte-level inputs significantly outperformed the token-based mT5 model in
terms of multilingual G2P. Pairwise comparison with monolingual models in these
languages suggests that multilingual ByT5 models generally lower the phone
error rate by jointly learning from a variety of languages. The pretrained
model can further benefit low resource G2P through zero-shot prediction on
unseen languages or provides pretrained weights for finetuning, which helps the
model converge to a lower phone error rate than randomly initialized weights.
To facilitate future research on multilingual G2P, we make available our code
and pretrained multilingual G2P models at:
this https URL.

使用 ByT5 模型，我们从不同来源中整理出覆盖 100 种语言的 G2P 数据集，并训练了大规模的多语种 G2P 模型。与单语模型相比，多语种 ByT5 模型通过同时学习多种语言降低了电话错误率，可进一步通过无监督预测或微调，帮助低资源语言的写作转音。