Decomposing model activations into interpretable components is a key open
problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a
popular method for decomposing the internal activations of trained transformers
into sparse, interpretable features, and have been applied to MLP layers and
the residual stream. In this work we train SAEs on attention layer outputs and
show that also here SAEs find a sparse, interpretable decomposition. We
demonstrate this on transformers from several model families and up to 2B
parameters.
We perform a qualitative study of the features computed by attention layers,
and find multiple families: long-range context, short-range context and
induction features. We qualitatively study the role of every head in GPT-2
Small, and estimate that at least 90% of the heads are polysemantic, i.e. have
multiple unrelated roles.
Further, we show that Sparse Autoencoders are a useful tool that enable
researchers to explain model behavior in greater detail than prior work. For
example, we explore the mystery of why models have so many seemingly redundant
induction heads, use SAEs to motivate the hypothesis that some are long-prefix
whereas others are short-prefix, and confirm this with more rigorous analysis.
We use our SAEs to analyze the computation performed by the Indirect Object
Identification circuit (Wang et al.), validating that the SAEs find causally
meaningful intermediate variables, and deepening our understanding of the
semantics of the circuit. We open-source the trained SAEs and a tool for
exploring arbitrary prompts through the lens of Attention Output SAEs.

稀疏自编码器被应用于解释训练好的 Transformer 模型的内部激活值，发现它们能够找到一种稀疏而可解释的分解表示，从而帮助研究人员更详细地解释模型行为，并深化对电路语义的理解。

使用稀疏自编码器解释注意力层输出

Interpreting Attention Layer Outputs with Sparse Autoencoders

Transformers are widely used to extract complex semantic meanings from input
tokens, yet they usually operate as black-box models. In this paper, we present
a simple yet informative decomposition of hidden states (or embeddings) of
trained transformers into interpretable components. For any layer, embedding
vectors of input sequence samples are represented by a tensor $\boldsymbol{h}
\in \mathbb{R}^{C \times T \times d}$. Given embedding vector
$\boldsymbol{h}_{c,t} \in \mathbb{R}^d$ at sequence position $t \le T$ in a
sequence (or context) $c \le C$, extracting the mean effects yields the
decomposition \[ \boldsymbol{h}_{c,t} = \boldsymbol{\mu} + \mathbf{pos}_t +
\mathbf{ctx}_c + \mathbf{resid}_{c,t} \] where $\boldsymbol{\mu}$ is the global
mean vector, $\mathbf{pos}_t$ and $\mathbf{ctx}_c$ are the mean vectors across
contexts and across positions respectively, and $\mathbf{resid}_{c,t}$ is the
residual vector. For popular transformer architectures and diverse text
datasets, empirically we find pervasive mathematical structure: (1)
$(\mathbf{pos}_t)_{t}$ forms a low-dimensional, continuous, and often spiral
shape across layers, (2) $(\mathbf{ctx}_c)_c$ shows clear cluster structure
that falls into context topics, and (3) $(\mathbf{pos}_t)_{t}$ and
$(\mathbf{ctx}_c)_c$ are mutually incoherent -- namely $\mathbf{pos}_t$ is
almost orthogonal to $\mathbf{ctx}_c$ -- which is canonical in compressed
sensing and dictionary learning. This decomposition offers structural insights
about input formats in in-context learning (especially for induction heads) and
in arithmetic tasks.

通过将训练后的 Transformer 的隐藏状态或嵌入分解成可解释的组件，本文介绍了一种简单而有信息量的方法，揭示了输入格式在上下文学习和算术任务中的结构洞察。

通过解耦位置和上下文揭示 Transformer 中的隐藏几何结构

Uncovering hidden geometry in Transformers via disentangling position  and context

Despite its importance, choosing the structural form of the kernel in
nonparametric regression remains a black art. We define a space of kernel
structures which are built compositionally by adding and multiplying a small
number of base kernels. We present a method for searching over this space of
structures which mirrors the scientific discovery process. The learned
structures can often decompose functions into interpretable components and
enable long-range extrapolation on time-series datasets. Our structure search
method outperforms many widely used kernels and kernel combination methods on a
variety of prediction tasks.

本文提出一种通过组合基本核函数来寻找最优核函数的方法，使得拟合函数能够分解为易于理解的部分，从而实现对时间序列数据的长程外推，并在多种预测任务上表现出色。