Since its introduction, the transformer model has demonstrated outstanding performance across various tasks. However, there are still unresolved issues regarding length generalization, particularly in algorithmic tasks. In this paper, we investigate the inherent capabilities of transformer models in learning arithmetic algorithms, such as addition and multiplication. Through experiments and attention analysis, we identify a number of crucial factors for achieving optimal length generalization. We show that transformer models are able to generalize to long lengths with the help of targeted attention biasing. We then introduce Attention Bias Calibration (ABC), a calibration stage that enables the model to automatically learn the proper attention biases, which we link to mechanisms in relative position encoding. We demonstrate that using ABC, the transformer model can achieve unprecedented perfect length generalization on certain arithmetic tasks.

通过实验和注意力分析，我们研究了transformer模型在学习算术算法（如加法和乘法）方面的固有能力，并确定了几个实现最佳长度泛化的关键因素。我们展示了transformer模型能够借助有针对性的注意力偏置来推广到长长度，并引入了注意力偏置校准（ABC）阶段，使模型能够自动学习适当的注意力偏置，我们将其与相对位置编码中的机制联系起来。我们证明使用ABC，transformer模型能够在某些算术任务上达到前所未有的完美长度广义。

从插值到外推：算术Transformer的完全长度概括