Large-scale training datasets lie at the core of the recent success of neural machine translation (NMT) models. However, the complex patterns and potential noises in the large-scale data make training NMT models difficult. In this work, we explore to identify the inactive training examples which contribute less to the model performance, and show that the existence of inactive examples depends on the data distribution. We further introduce data rejuvenation to improve the training of NMT models on large-scale datasets by exploiting inactive examples. The proposed framework consists of three phases. First, we train an identification model on the original training data, and use it to distinguish inactive examples and active examples by their sentence-level output probabilities. Then, we train a rejuvenation model on the active examples, which is used to re-label the inactive examples with forward-translation. Finally, the rejuvenated examples and the active examples are combined to train the final NMT model. Experimental results on WMT14 English-German and English-French datasets show that the proposed data rejuvenation consistently and significantly improves performance for several strong NMT models. Extensive analyses reveal that our approach stabilizes and accelerates the training process of NMT models, resulting in final models with better generalization capability.

本文介绍了一种利用数据焕发来提高神经机器翻译模型在大规模数据集上的训练效果的方法，该方法需要通过训练一个识别模型，识别出不活跃的数据样本，然后使用一个焕发模型，对样本进行重新标记，最后将焕发后的样本和活跃样本组合来训练最终的神经机器翻译模型，实验结果表明该方法可以显著提高模型表现，特别是在大数据集上。

数据复兴：利用不活跃的训练样例进行神经机器翻译