This paper considers the unsupervised domain adaptation problem for neural machine translation (NMT), where we assume the access to only monolingual text in either the source or target language in the new domain. We propose a cross-lingual data selection method to extract in-domain sentences in the missing language side from a large generic monolingual corpus. Our proposed method trains an adaptive layer on top of multilingual BERT by contrastive learning to align the representation between the source and target language. This then enables the transferability of the domain classifier between the languages in a zero-shot manner. Once the in-domain data is detected by the classifier, the NMT model is then adapted to the new domain by jointly learning translation and domain discrimination tasks. We evaluate our cross-lingual data selection method on NMT across five diverse domains in three language pairs, as well as a real-world scenario of translation for COVID-19. The results show that our proposed method outperforms other selection baselines up to +1.5 BLEU score.

本文针对神经机器翻译中的无监督领域自适应问题，提出一种跨语料库数据选择方法，通过对多语言BERT进行对比学习，实现源语言和目标语言之间的表示对齐，从而实现零样本领域分类器的可转移性，并且通过联合学习翻译任务和领域区分任务来适应新领域。我们在五个不同的领域和三种语言对的神经机器翻译上进行了跨语料库数据选择方法的评估，并在COVID-19疫情实时翻译中进行了应用验证， 实验结果表明，我们提出的方法相对于基线方法的BLEU指标得分提高了1.5个百分点。

基于多语数据选择的神经机器翻译领域通用无监督适应