This paper introduces a new data augmentation method for neural machine translation that can enforce stronger semantic consistency both within and across languages. Our method is based on Conditional Masked Language Model (CMLM) which is bi-directional and can be conditional on both left and right context, as well as the label. We demonstrate that CMLM is a good technique for generating context-dependent word distributions. In particular, we show that CMLM is capable of enforcing semantic consistency by conditioning on both source and target during substitution. In addition, to enhance diversity, we incorporate the idea of soft word substitution for data augmentation which replaces a word with a probabilistic distribution over the vocabulary. Experiments on four translation datasets of different scales show that the overall solution results in more realistic data augmentation and better translation quality. Our approach consistently achieves the best performance in comparison with strong and recent works and yields improvements of up to 1.90 BLEU points over the baseline.

本文介绍了一种新的神经机器翻译数据增强方法，可以在语言内外强制实现更强的语义一致性。结果表明，条件掩蔽语言模型是一种生成上下文相关单词分布的有效技术，并集成了软词替换的思想，以增强数据多样性，加强语义一致性。该方法在四个规模不同的翻译数据集上进行的实验结果，展示了更真实的数据增强和更好的翻译质量，相对于强和最新的工作，我们的方法一致实现了最佳性能，并相对于基线改进了高达1.90 BLEU分数。

基于条件掩码语言模型的神经机器翻译语义一致数据增强