We present an empirical study of scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT). We show that cross-entropy loss as a function of model size follows a certain scaling law. Specifically (i) We propose a formula which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and show that it gives accurate predictions under a variety of scaling approaches and languages; we show that the total number of parameters alone is not sufficient for such purposes. (ii) We observe different power law exponents when scaling the decoder vs scaling the encoder, and provide recommendations for optimal allocation of encoder/decoder capacity based on this observation. (iii) We also report that the scaling behavior of the model is acutely influenced by composition bias of the train/test sets, which we define as any deviation from naturally generated text (either via machine generated or human translated text). We observe that natural text on the target side enjoys scaling, which manifests as successful reduction of the cross-entropy loss. (iv) Finally, we investigate the relationship between the cross-entropy loss and the quality of the generated translations. We find two different behaviors, depending on the nature of the test data. For test sets which were originally translated from target language to source language, both loss and BLEU score improve as model size increases. In contrast, for test sets originally translated from source language to target language, the loss improves, but the BLEU score stops improving after a certain threshold. We release generated text from all models used in this study.

通过本文所提及的实证研究，我们揭示了神经机器翻译中，编码器-解码器Transformer模型的扩展特性。具体而言，本文提出了一个公式来描述交叉熵损失与编解码器大小的扩大倍数之间的关系，并在多种扩展方法及语言下展现了估计的准确性。我们同时观察到编码器与解码器扩展的效应不同，基于此提供了编码器/解码器容量的最优化分配建议。我们还发现模型的扩展表现受到训练/测试集组成偏差的极大影响，称之为“构造偏差”，这种偏差对减少交叉熵损失十分重要。最后，我们调查了交叉熵损失与生成的翻译质量之间的关系，并发现在测试数据自不同语言的数据翻译转换中，模型大小的改变对模型的推理质量有着不同的影响。我们在本研究中使用的所有模型的生成文本都允许被公开。

神经机器翻译的规模定律