This study investigates the factors influencing the performance of multilingual large language models (MLLMs) across diverse languages. We study 6 MLLMs, including masked language models, autoregressive models, and instruction-tuned LLMs, on the SIB-200 dataset, a topic classification dataset encompassing 204 languages. Our analysis considers three scenarios: ALL languages, SEEN languages (present in the model's pretraining data), and UNSEEN languages (not present or documented in the model's pretraining data in any meaningful way). We examine the impact of factors such as pretraining data size, general resource availability, language family, and script type on model performance. Decision tree analysis reveals that pretraining data size is the most influential factor for SEEN languages. However, interestingly, script type and language family are crucial for UNSEEN languages, highlighting the importance of cross-lingual transfer learning. Notably, model size and architecture do not significantly alter the most important features identified. Our findings provide valuable insights into the strengths and limitations of current MLLMs and hope to guide the development of more effective and equitable multilingual NLP systems.

通过研究204种语言的多语言大型语言模型（MLLMs）在不同语言上的表现，考察了预训练数据大小、资源可用性、语言家族和脚本类型等因素对模型性能的影响，并发现对于已知语言来说，预训练数据大小是最重要的因素，而对于未知语言来说，脚本类型和语言家族至关重要。模型大小和结构并不显著改变最重要的特征，这些研究结果为当前MLLMs的优势和局限性提供了有价值的见解，并希望指导更有效、公平的多语言自然语言处理系统的开发。

多语言语言模型的绩效驱动因素是什么？