Natural language understanding and generation models follow one of the two dominant architectural paradigms: language models (LMs) that process concatenated sequences in a single stack of layers, and encoder-decoder models (EncDec) that utilize separate layer stacks for input and output processing. In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs. In this work, we thoroughly examine the role of several architectural design choices on the performance of LMs on bilingual, (massively) multilingual and zero-shot translation tasks, under systematic variations of data conditions and model sizes. Our results show that: (i) Different LMs have different scaling properties, where architectural differences often have a significant impact on model performance at small scales, but the performance gap narrows as the number of parameters increases, (ii) Several design choices, including causal masking and language-modeling objectives for the source sequence, have detrimental effects on translation quality, and (iii) When paired with full-visible masking for source sequences, LMs could perform on par with EncDec on supervised bilingual and multilingual translation tasks, and improve greatly on zero-shot directions by facilitating the reduction of off-target translations.

该研究论文探讨了语言模型和编码器-解码器模型在机器翻译中的性能影响，结果表明：语言模型在小规模下的表现差，但随着参数数量的增加，其性能逐渐接近于编码器-解码器模型，语言建模和原因屏蔽会对翻译质量产生不利影响，而当与全局可视屏蔽相配合时，语言模型在受监督的双语和多语言翻译任务中能够与编码器-解码器模型持平，并且在零翻译方向上表现得更好。

探讨语言模型架构的扩展与迁移在机器翻译中的应用