This research provides the first comprehensive analysis of the performance of pre-trained language models for Sinhala text classification. We test on a set of different Sinhala text classification tasks and our analysis shows that out of the pre-trained multilingual models that include Sinhala (XLM-R, LaBSE, and LASER), XLM-R is the best model by far for Sinhala text classification. We also pre-train two RoBERTa-based monolingual Sinhala models, which are far superior to the existing pre-trained language models for Sinhala. We show that when fine-tuned, these pre-trained language models set a very strong baseline for Sinhala text classification and are robust in situations where labeled data is insufficient for fine-tuning. We further provide a set of recommendations for using pre-trained models for Sinhala text classification. We also introduce new annotated datasets useful for future research in Sinhala text classification and publicly release our pre-trained models.

该研究是第一篇全面分析面向Sinhala文本分类的预训练语言模型性能的文章。我们在一组不同的Sinhala文本分类任务上进行测试，发现包含Sinhala的预训练多语言模型（XLM-R，LaBSE和LASER）中，XLM-R是迄今为止最好的模型。我们还预训了两种基于RoBERTa的单语Sinhala模型，这些模型比现有的预训练语言模型在Sinhala方面更为优越。我们表明，当对这些预训练语言模型进行微调时，它们为Sinhala文本分类设定了非常强大的基线，并且在标记数据不足以进行微调的情况下具有鲁棒性。我们进一步提供了一组关于使用Sinhala文本分类的预训练模型的建议。我们还推出了未来Sinhala文本分类研究中有用的新的注释数据集，并公开了我们的预训练模型。

BERT化锡兰语：锡兰文本分类预训练语言模型的全面分析