Multilingual language models such as mBERT have seen impressive cross-lingual transfer to a variety of languages, but many languages remain excluded from these models. In this paper, we analyse the effect of pre-training with monolingual data for a low-resource language that is not included in mBERT -- Maltese -- with a range of pre-training set ups. We conduct evaluations with the newly pre-trained models on three morphosyntactic tasks -- dependency parsing, part-of-speech tagging, and named-entity recognition -- and one semantic classification task -- sentiment analysis. We also present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance. Our results show that using a mixture of pre-training domains is often superior to using Wikipedia text only. We also find that a fraction of this corpus is enough to make significant leaps in performance over Wikipedia-trained models. We pre-train and compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu). The models achieve state-of-the-art performance on these tasks, despite the new corpus being considerably smaller than typically used corpora for high-resourced languages. On average, BERTu outperforms or performs competitively with mBERTu, and the largest gains are observed for higher-level tasks.

本文分析了使用单语数据进行预训练对于mBERT中未包含的低资源语种（如马耳他语）的效果，并研究了新的马耳他语语料库的大小和域对下游任务性能的影响。研究表明，使用混合预训练域往往优于仅使用维基百科文本，并且只有一小部分的马耳他语语料库就足以在任务性能上取得显著提高。此外，本文还预训练并比较了两个模型：从头开始训练的单语BERT模型（BERTu）和进一步预训练的多语言BERT模型（mBERTu），这两个模型都在各种下游任务上取得了最先进的性能。

为低资源语言预训练数据质量和数量: 马耳他语新语料库和BERT模型