We present Belebele, a multiple-choice machine reading comprehension (MRC)
dataset spanning 122 language variants. Significantly expanding the language
coverage of natural language understanding (NLU) benchmarks, this dataset
enables the evaluation of text models in high-, medium-, and low-resource
languages. Each question is based on a short passage from the Flores-200
dataset and has four multiple-choice answers. The questions were carefully
curated to discriminate between models with different levels of general
language comprehension. The English dataset on its own proves difficult enough
to challenge state-of-the-art language models. Being fully parallel, this
dataset enables direct comparison of model performance across all languages. We
use this dataset to evaluate the capabilities of multilingual masked language
models (MLMs) and large language models (LLMs). We present extensive results
and find that despite significant cross-lingual transfer in English-centric
LLMs, much smaller MLMs pretrained on balanced multilingual data still
understand far more languages. We also observe that larger vocabulary size and
conscious vocabulary construction correlate with better performance on
low-resource languages. Overall, Belebele opens up new avenues for evaluating
and analyzing the multilingual capabilities of NLP systems.

我们提供了 Belebele，这是一个涵盖了 122 种语言的多选机器阅读理解（MRC）数据集。该数据集显著扩展了自然语言理解（NLU）基准的语言覆盖范围，在高、中、低资源语言中评估了文本模型，从而使得对模型性能的直接比较成为可能。通过这个数据集，我们评估了多语言掩码语言模型（MLMs）和大型语言模型（LLMs）的能力，并得出了一些结论。

Belebele 基准测试：122 种语言变体的平行阅读理解数据集

The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122  Language Variants

Recent work has demonstrated the effectiveness of cross-lingual language
model pretraining for cross-lingual understanding. In this study, we present
the results of two larger multilingual masked language models, with 3.5B and
10.7B parameters. Our two new models dubbed XLM-R XL and XLM-R XXL outperform
XLM-R by 1.8% and 2.4% average accuracy on XNLI. Our model also outperforms the
RoBERTa-Large model on several English tasks of the GLUE benchmark by 0.3% on
average while handling 99 more languages. This suggests pretrained models with
larger capacity may obtain both strong performance on high-resource languages
while greatly improving low-resource languages. We make our code and models
publicly available.

本研究探讨了跨语言语言模型预训练的有效性，并且提出了两个参数分别为 3.5B 和 10.7B 的大型多语言掩码语言模型，这两个新模型分别称为 XLM-R XL 和 XLM-R XXL，在 XNLI 中的平均准确率比 XLM-R 高 1.8％和 2.4％，同时处理了 99 种以上的语言，优于 RoBERTa-Large 模型，表明拥有更大容量的预训练模型可以在高资源语言上取得强大的性能，同时极大地改善了低资源语言。

面向多语言掩码语言建模的大规模 Transformer

Larger-Scale Transformers for Multilingual Masked Language Modeling

The paper introduces methods of adaptation of multilingual masked language
models for a specific language. Pre-trained bidirectional language models show
state-of-the-art performance on a wide range of tasks including reading
comprehension, natural language inference, and sentiment analysis. At the
moment there are two alternative approaches to train such models: monolingual
and multilingual. While language specific models show superior performance,
multilingual models allow to perform a transfer from one language to another
and solve tasks for different languages simultaneously. This work shows that
transfer learning from a multilingual model to monolingual model results in
significant growth of performance on such tasks as reading comprehension,
paraphrase detection, and sentiment analysis. Furthermore, multilingual
initialization of monolingual model substantially reduces training time.
Pre-trained models for the Russian language are open sourced.

该论文介绍了适用于特定语言的多语言遮蔽语言模型的自适应方法，展示了从多语言模型到单语言模型的迁移学习可以显著提高阅读理解、情感分析等任务的性能，且多语言初始化的单语言模型可以大幅度降低训练时间。俄语的预训练模型已公开。