This study investigates the factors influencing the performance of
multilingual large language models (MLLMs) across diverse languages. We study 6
MLLMs, including masked language models, autoregressive models, and
instruction-tuned LLMs, on the SIB-200 dataset, a topic classification dataset
encompassing 204 languages. Our analysis considers three scenarios: ALL
languages, SEEN languages (present in the model's pretraining data), and UNSEEN
languages (not present or documented in the model's pretraining data in any
meaningful way). We examine the impact of factors such as pretraining data
size, general resource availability, language family, and script type on model
performance. Decision tree analysis reveals that pretraining data size is the
most influential factor for SEEN languages. However, interestingly, script type
and language family are crucial for UNSEEN languages, highlighting the
importance of cross-lingual transfer learning. Notably, model size and
architecture do not significantly alter the most important features identified.
Our findings provide valuable insights into the strengths and limitations of
current MLLMs and hope to guide the development of more effective and equitable
multilingual NLP systems.

通过研究 204 种语言的多语言大型语言模型（MLLMs）在不同语言上的表现，考察了预训练数据大小、资源可用性、语言家族和脚本类型等因素对模型性能的影响，并发现对于已知语言来说，预训练数据大小是最重要的因素，而对于未知语言来说，脚本类型和语言家族至关重要。模型大小和结构并不显著改变最重要的特征，这些研究结果为当前 MLLMs 的优势和局限性提供了有价值的见解，并希望指导更有效、公平的多语言自然语言处理系统的开发。

多语言语言模型的绩效驱动因素是什么？

What Drives Performance in Multilingual Language Models?

Transformers-based pretrained language models achieve outstanding results in
many well-known NLU benchmarks. However, while pretraining methods are very
convenient, they are expensive in terms of time and resources. This calls for a
study of the impact of pretraining data size on the knowledge of the models. We
explore this impact on the syntactic capabilities of RoBERTa, using models
trained on incremental sizes of raw text data. First, we use syntactic
structural probes to determine whether models pretrained on more data encode a
higher amount of syntactic information. Second, we perform a targeted syntactic
evaluation to analyze the impact of pretraining data size on the syntactic
generalization performance of the models. Third, we compare the performance of
the different models on three downstream applications: part-of-speech tagging,
dependency parsing and paraphrase identification. We complement our study with
an analysis of the cost-benefit trade-off of training such models. Our
experiments show that while models pretrained on more data encode more
syntactic knowledge and perform better on downstream applications, they do not
always offer a better performance across the different syntactic phenomena and
come at a higher financial and environmental cost.

本研究探讨了预训练数据大小对 RoBERTa 模型的句法能力及其在下游应用中的影响，并分析了训练此类模型的成本效益权衡。结果显示，虽然预训练数据大小的增加会显著提高模型的句法能力及在下游任务中表现，但这也带来了更高的经济和环境成本。