Since its introduction, the transformer model has demonstrated outstanding
performance across various tasks. However, there are still unresolved issues
regarding length generalization, particularly in algorithmic tasks. In this
paper, we investigate the inherent capabilities of transformer models in
learning arithmetic algorithms, such as addition and multiplication. Through
experiments and attention analysis, we identify a number of crucial factors for
achieving optimal length generalization. We show that transformer models are
able to generalize to long lengths with the help of targeted attention biasing.
We then introduce Attention Bias Calibration (ABC), a calibration stage that
enables the model to automatically learn the proper attention biases, which we
link to mechanisms in relative position encoding. We demonstrate that using
ABC, the transformer model can achieve unprecedented perfect length
generalization on certain arithmetic tasks.

通过实验和注意力分析，我们研究了 transformer 模型在学习算术算法（如加法和乘法）方面的固有能力，并确定了几个实现最佳长度泛化的关键因素。我们展示了 transformer 模型能够借助有针对性的注意力偏置来推广到长长度，并引入了注意力偏置校准（ABC）阶段，使模型能够自动学习适当的注意力偏置，我们将其与相对位置编码中的机制联系起来。我们证明使用 ABC，transformer 模型能够在某些算术任务上达到前所未有的完美长度广义。

从插值到外推：算术 Transformer 的完全长度概括

From Interpolation to Extrapolation: Complete Length Generalization for  Arithmetic Transformers

Investigating deep learning language models has always been a significant
research area due to the ``black box" nature of most advanced models. With the
recent advancements in pre-trained language models based on transformers and
their increasing integration into daily life, addressing this issue has become
more pressing. In order to achieve an explainable AI model, it is essential to
comprehend the procedural steps involved and compare them with human thought
processes. Thus, in this paper, we use simple, well-understood non-language
tasks to explore these models' inner workings. Specifically, we apply a
pre-trained language model to constrained arithmetic problems with hierarchical
structure, to analyze their attention weight scores and hidden states. The
investigation reveals promising results, with the model addressing hierarchical
problems in a moderately structured manner, similar to human problem-solving
strategies. Additionally, by inspecting the attention weights layer by layer,
we uncover an unconventional finding that layer 10, rather than the model's
final layer, is the optimal layer to unfreeze for the least parameter-intensive
approach to fine-tune the model. We support these findings with entropy
analysis and token embeddings similarity analysis. The attention analysis
allows us to hypothesize that the model can generalize to longer sequences in
ListOps dataset, a conclusion later confirmed through testing on sequences
longer than those in the training set. Lastly, by utilizing a straightforward
task in which the model predicts the winner of a Tic Tac Toe game, we identify
limitations in attention analysis, particularly its inability to capture 2D
patterns.

本文使用约束算术问题，分析了预训练语言模型中注意力权重分数和隐藏状态。我们发现模型可以以适度结构化的方式解决分层问题，类似于人类解决问题的策略，并推断出模型可以推广到长度超过训练集的序列。注意力分析发现，相对于模型的最终层，第 10 层是解决模型最优的层。同时，我们发现注意力分析存在局限性，特别是无法捕捉二维模式。

揭开黑匣子：分析预训练语言模型中的注意力权重和隐藏状态在非语言任务中的应用

Opening the Black Box: Analyzing Attention Weights and Hidden States in  Pre-trained Language Models for Non-language Tasks

Recently, many pre-trained language models for source code have been proposed
to model the context of code and serve as a basis for downstream code
intelligence tasks such as code completion, code search, and code
summarization. These models leverage masked pre-training and Transformer and
have achieved promising results. However, currently there is still little
progress regarding interpretability of existing pre-trained code models. It is
not clear why these models work and what feature correlations they can capture.
In this paper, we conduct a thorough structural analysis aiming to provide an
interpretation of pre-trained language models for source code (e.g., CodeBERT,
and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis,
(2) probing on the word embedding, and (3) syntax tree induction. Through
comprehensive analysis, this paper reveals several insightful findings that may
inspire future studies: (1) Attention aligns strongly with the syntax structure
of code. (2) Pre-training language models of code can preserve the syntax
structure of code in the intermediate representations of each Transformer
layer. (3) The pre-trained models of code have the ability of inducing syntax
trees of code. Theses findings suggest that it may be helpful to incorporate
the syntax structure of code into the process of pre-training for better code
representations.

此篇论文分析了预训练语言模型，尤其是 CodeBERT 和 GraphCodeBERT 对源代码的结构性质，通过对注意力分析，词嵌入的探索和语法树归纳等方面进行全面分析，揭示出了一些深入的发现，为今后的相关研究提供了启示。