Position modeling plays a critical role in Transformers. In this paper, we
focus on length extrapolation, i.e., training on short texts while evaluating
longer sequences. We define attention resolution as an indicator of
extrapolation. Then we propose two designs to improve the above metric of
Transformers. Specifically, we introduce a relative position embedding to
explicitly maximize attention resolution. Moreover, we use blockwise causal
attention during inference for better resolution. We evaluate different
Transformer variants with language modeling. Experimental results show that our
model achieves strong performance in both interpolation and extrapolation
settings. The code will be available at this https URL.

本文探讨了 Transformers 中的位置建模以及如何提高其对于长文本的预测能力，通过引入相对位置编码和块状因果注意力机制，可以有效提高模型的预测性能。

可长度推广的 Transformer

A Length-Extrapolatable Transformer

Non-autoregressive models are promising on various text generation tasks.
Previous work hardly considers to explicitly model the positions of generated
words. However, position modeling is an essential problem in non-autoregressive
text generation. In this study, we propose PNAT, which incorporates positions
as a latent variable into the text generative process. Experimental results
show that PNAT achieves top results on machine translation and paraphrase
generation tasks, outperforming several strong baselines.

本研究提出 PNAT，将位置建模作为非自回归文本生成过程的一个潜变量。实验结果表明，PNAT 在机器翻译和转述生成任务中取得了最佳结果，优于几个强基线模型。