Many document-level neural machine translation (NMT) systems have explored
the utility of context-aware architecture, usually requiring an increasing
number of parameters and computational complexity. However, few attention is
paid to the baseline model. In this paper, we research extensively the pros and
cons of the standard transformer in document-level translation, and find that
the auto-regressive property can simultaneously bring both the advantage of the
consistency and the disadvantage of error accumulation. Therefore, we propose a
surprisingly simple long-short term masking self-attention on top of the
standard transformer to both effectively capture the long-range dependence and
reduce the propagation of errors. We examine our approach on the two publicly
available document-level datasets. We can achieve a strong result in BLEU and
capture discourse phenomena.

本研究探索了基于上下文感知框架的神经机器翻译系统，研究发现标准 Transformer 自回归属性可以同时带来一致性和误差积累的优势和劣势，因此提出了一种简单的基于长短时记忆的自注意力机制用于捕捉长距离依赖并减少误差传播，在两个公开数据集上验证可以获得较高 BLEU 得分并捕捉语篇现象。