Length extrapolation is a desirable property that permits training a transformer language model on short sequences and retaining similar perplexities when the model is tested on substantially longer sequences. A relative positional embedding mechanism applied on the transformer self-attention matrix, ALiBi, demonstrates the length extrapolation property with the widest usage to date. In this paper, we show that ALiBi surprisingly does not utilize tokens further than the training sequence length, which can be explained by its implicit windowed attention effect that aligns the receptive field during training and testing stages. Inspired by ALiBi and the receptive filed alignment hypothesis, we propose another transformer positional embedding design named~\textbf{Sandwich} that uses longer than training sequence length information, and it is a greatly simplified formulation of the earliest proposed Sinusoidal positional embedding. Finally, we show that both ALiBi and Sandwich enable efficient inference thanks to their implicit windowed attention effect.

研究了相对位置嵌入在语言模型上的应用，提出了基于对齐假设的自注意力机制，在训练过程中对齐输入，在测试过程中保证了相对位置嵌入的性质。提出的Sandwich positional embedding将比训练序列更长的信息融入模型之中，且由于隐式窗口化的自注意力机制，其可实现高效的推断。

感受野对齐实现Transformer长度外推