It is a widely known issue that Transformers, when trained on shorter sequences, fail to generalize robustly to longer ones at test time. This raises the question of whether Transformer models are real reasoning engines, despite their impressive abilities in mathematical problem solving and code synthesis. In this paper, we offer a vanishing variance perspective on this issue. To the best of our knowledge, we are the first to demonstrate that even for today's frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. On the argmax retrieval and dictionary lookup tasks, our experiments show that applying layer normalization after the attention outputs leads to significantly better length generalization. Our analyses attribute this improvement to a reduction-though not a complete elimination-of the distribution shift caused by vanishing variance.

本研究解决了变压器模型在短序列训练后对长序列泛化不良的问题。我们首次从消失方差的角度证明了较长序列长度导致多头注意力模块输出方差降低的现象。实验结果显示，在注意力输出后应用层归一化显著改善了长度泛化效果，说明这种改进有助于减少由消失方差引起的分布偏移。

关于变压器长度泛化中的消失方差