Domain adaptive pretraining, i.e. the continued unsupervised pretraining of a language model on domain-specific text, improves the modelling of text for downstream tasks within the domain. Numerous real-world applications are based on domain-specific text, e.g. working with financial or biomedical documents, and these applications often need to support multiple languages. However, large-scale domain-specific multilingual pretraining data for such scenarios can be difficult to obtain, due to regulations, legislation, or simply a lack of language- and domain-specific text. One solution is to train a single multilingual model, taking advantage of the data available in as many languages as possible. In this work, we explore the benefits of domain adaptive pretraining with a focus on adapting to multiple languages within a specific domain. We propose different techniques to compose pretraining corpora that enable a language model to both become domain-specific and multilingual. Evaluation on nine domain-specific datasets-for biomedical named entity recognition and financial sentence classification-covering seven different languages show that a single multilingual domain-specific model can outperform the general multilingual model, and performs close to its monolingual counterpart. This finding holds across two different pretraining methods, adapter-based pretraining and full model pretraining.

本研究旨在探究利用领域自适应预训练的方法，结合多语言语料库来同时训练一个领域特定和多语言的语言模型，从而提高目标领域内不同语言任务的文本建模能力。研究结果表明，用这种模型在生物医学命名实体识别和金融句子分类等多个领域特定数据集上进行测试，可以比一般的多语言模型表现更好，接近于单语言情况下的性能表现。

MDAPT: 单模型多语种领域自适应预训练