Pretrained multilingual language models can help bridge the digital language divide, enabling high-quality NLP models for lower resourced languages. Studies of multilingual models have so far focused on performance, consistency, and cross-lingual generalisation. However, with their wide-spread application in the wild and downstream societal impact, it is important to put multilingual models under the same scrutiny as monolingual models. This work investigates the group fairness of multilingual models, asking whether these models are equally fair across languages. To this end, we create a new four-way multilingual dataset of parallel cloze test examples (MozArt), equipped with demographic information (balanced with regard to gender and native tongue) about the test participants. We evaluate three multilingual models on MozArt -- mBERT, XLM-R, and mT5 -- and show that across the four target languages, the three models exhibit different levels of group disparity, e.g., exhibiting near-equal risk for Spanish, but high levels of disparity for German.

探讨预训练多语言语言模型的组公平性，通过创建一个新的平行洞察测试实例的多语言数据集（MozArt）及使用人口统计信息来评估三种多语言模型（mBERT，XLM-R和mT5），我们发现这三种模型在四种目标语言中表现出不同程度的组不公平性，例如在西班牙语中表现出接近相等的风险，但在德语中表现出高水平的不平等。

预训练的多语言模型在不同语言间的公平性是否相同？