The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed. However, we argue that, especially for the scientific community, encoder models of up to 1 billion parameters are still very much needed, their primary usage being in enriching large collections of data with metadata necessary for downstream research. We investigate the best way to ensure the existence of such encoder models on the set of very closely related languages - Croatian, Serbian, Bosnian and Montenegrin, by setting up a diverse benchmark for these languages, and comparing the trained-from-scratch models with the new models constructed via additional pretraining of existing multilingual models. We show that comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models even with a limited amount of computation. We also show that neighboring languages, in our case Slovenian, can be included in the additional pretraining with little to no loss in the performance of the final model.

我们研究了利用现有的多语言模型进行额外预训练，以确保在克罗地亚语、塞尔维亚语、波斯尼亚语和黑山语这几种密切相关的语言集合中存在具有10亿参数的编码器模型的最佳方法，结果显示即使计算量有限，额外预训练可获得与从头开发的模型相当的性能，同时表明邻近语言（如斯洛文尼亚语）的额外预训练几乎不会影响最终模型的性能。

语言模型的减肥计划：通过额外预训练进行相关语言编码器的成本高效开发