Viewing Transformers as interacting particle systems, we describe the
geometry of learned representations when the weights are not time dependent. We
show that particles, representing tokens, tend to cluster toward particular
limiting objects as time tends to infinity. The type of limiting object that
emerges depends on the spectrum of the value matrix. Additionally, in the
one-dimensional case we prove that the self-attention matrix converges to a
low-rank Boolean matrix. The combination of these results mathematically
confirms the empirical observation made by Vaswani et al.
\cite{vaswani2017attention} that \emph{leaders} appear in a sequence of tokens
when processed by Transformers.

本文将 Transformer 视为相互作用的粒子系统，描述了当权重不随时间变化时，学习表示的几何特征，证明了表示中的粒子会在时间趋于无穷时聚集到特定的极限对象，这取决于值矩阵的谱。同时，在一维情况下，证明了自我关注矩阵收敛于低秩布尔矩阵。这些结果的组合在数学上证实了 Vaswani 等人的经验观察，即在 Transformers 处理一系列标记时会出现 “leader”。

自注意力动态中群集的出现

The emergence of clusters in self-attention dynamics

Length extrapolation is a desirable property that permits training a
transformer language model on short sequences and retaining similar
perplexities when the model is tested on substantially longer sequences. A
relative positional embedding mechanism applied on the transformer
self-attention matrix, ALiBi, demonstrates the length extrapolation property
with the widest usage to date. In this paper, we show that ALiBi surprisingly
does not utilize tokens further than the training sequence length, which can be
explained by its implicit windowed attention effect that aligns the receptive
field during training and testing stages. Inspired by ALiBi and the receptive
filed alignment hypothesis, we propose another transformer positional embedding
design named~\textbf{Sandwich} that uses longer than training sequence length
information, and it is a greatly simplified formulation of the earliest
proposed Sinusoidal positional embedding. Finally, we show that both ALiBi and
Sandwich enable efficient inference thanks to their implicit windowed attention
effect.

研究了相对位置嵌入在语言模型上的应用，提出了基于对齐假设的自注意力机制，在训练过程中对齐输入，在测试过程中保证了相对位置嵌入的性质。提出的 Sandwich positional embedding 将比训练序列更长的信息融入模型之中，且由于隐式窗口化的自注意力机制，其可实现高效的推断。

感受野对齐实现 Transformer 长度外推

Receptive Field Alignment Enables Transformer Length Extrapolation

Transformer networks are able to capture patterns in data coming from many
domains (text, images, videos, proteins, etc.) with little or no change to
architecture components. We perform a theoretical analysis of the core
component responsible for signal propagation between elements, i.e. the
self-attention matrix. In practice, this matrix typically exhibits two
properties: (1) it is sparse, meaning that each token only attends to a small
subset of other tokens; and (2) it changes dynamically depending on the input
to the module. With these considerations in mind, we ask the following
question: Can a fixed self-attention module approximate arbitrary sparse
patterns depending on the input? How small is the hidden size $d$ required for
such approximation? We make progress in answering this question and show that
the self-attention matrix can provably approximate sparse matrices, where
sparsity is in terms of a bounded number of nonzero elements in each row and
column. While the parameters of self-attention are fixed, various sparse
matrices can be approximated by only modifying the inputs. Our proof is based
on the random projection technique and uses the seminal Johnson-Lindenstrauss
lemma. Our proof is constructive, enabling us to propose an algorithm for
finding adaptive inputs and fixed self-attention parameters in order to
approximate a given matrix. In particular, we show that, in order to
approximate any sparse matrix up to a given precision defined in terms of
preserving matrix element ratios, $d$ grows only logarithmically with the
sequence length $L$ (i.e. $d = O(\log L)$).

本文研究了 Transformer 网络中的自注意力矩阵，重点分析了稀疏模式的逼近。我们证明了通过固定自注意力参数，采用不同的输入即可逼近各种稀疏矩阵，并提出了一种基于随机映射技术的构造性证明和算法。尤其是，在保持矩阵元素比率不变的精度下，仅需要 $log L$（L 为序列长度）的 $ d $ 即可逼近任何稀疏矩阵。