The rapid progress of research aimed at interpreting the inner workings of
advanced language models has highlighted a need for contextualizing the
insights gained from years of work in this area. This primer provides a concise
technical introduction to the current techniques used to interpret the inner
workings of Transformer-based language models, focusing on the generative
decoder-only architecture. We conclude by presenting a comprehensive overview
of the known internal mechanisms implemented by these models, uncovering
connections across popular approaches and active research directions in this
area.

这篇论文提供了对 Transformer-based 语言模型内部工作进行解释的当前技术的简明技术介绍，重点讨论生成式只解码器架构。我们最后总结了这些模型实现的已知内部机制的综合概述，揭示了该领域中流行方法和活跃研究方向之间的联系。

基于 Transformer 的语言模型内部工作原理初探

A Primer on the Inner Workings of Transformer-based Language Models

Transformers demonstrate impressive performance on a range of reasoning
benchmarks. To evaluate the degree to which these abilities are a result of
actual reasoning, existing work has focused on developing sophisticated
benchmarks for behavioral studies. However, these studies do not provide
insights into the internal mechanisms driving the observed capabilities. To
improve our understanding of the internal mechanisms of transformers, we
present a comprehensive mechanistic analysis of a transformer trained on a
synthetic reasoning task. We identify a set of interpretable mechanisms the
model uses to solve the task, and validate our findings using correlational and
causal evidence. Our results suggest that it implements a depth-bounded
recurrent mechanisms that operates in parallel and stores intermediate results
in selected token positions. We anticipate that the motifs we identified in our
synthetic setting can provide valuable insights into the broader operating
principles of transformers and thus provide a basis for understanding more
complex models.

通过对合成推理任务进行综合机械分析，我们鉴定了一组可解释的机制，这个模型用来解决任务，并使用相关和因果证据验证了我们的发现。我们的结果表明，它实现了一组深度有限的并行循环机制，并将中间结果存储在选择的令牌位置，我们期望我们在合成环境中鉴定的这些模式可以为理解变压器的更广泛操作原理提供有价值的见解。

一个关于训练于符号多步推理任务的 Transformer 的机制分析

A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step  Reasoning Task

Large Language Models (LLMs) have emerged as dominant foundational models in
modern NLP. However, the understanding of their prediction process and internal
mechanisms, such as feed-forward networks and multi-head self-attention,
remains largely unexplored. In this study, we probe LLMs from a human
behavioral perspective, correlating values from LLMs with eye-tracking
measures, which are widely recognized as meaningful indicators of reading
patterns. Our findings reveal that LLMs exhibit a prediction pattern distinct
from that of RNN-based LMs. Moreover, with the escalation of FFN layers, the
capacity for memorization and linguistic knowledge encoding also surges until
it peaks, subsequently pivoting to focus on comprehension capacity. The
functions of self-attention are distributed across multiple heads. Lastly, we
scrutinize the gate mechanisms, finding that they control the flow of
information, with some gates promoting, while others eliminating information.

基于人类行为学视角，我们探究了大型语言模型（LLMs）的预测过程和内部机制，通过将 LLMs 的值与眼动测量结果相关联，发现 LLMs 表现出与基于 RNN 的语言模型不同的预测模式。此外，随着前馈网络（FFN layers）的升级，记忆和语言知识编码的能力也逐渐提升直至达到巅峰，并转向注重理解能力。自注意力机制的功能分布在多个头部。最后，我们审查了门控机制，发现它们控制信息的流动，有些门控机制促进信息的传递，而其他门控机制则消除信息。