How do transformer-based large language models (LLMs) store and retrieve
knowledge? We focus on the most basic form of this task -- factual recall,
where the model is tasked with explicitly surfacing stored facts in prompts of
form `Fact: The Colosseum is in the country of'. We find that the mechanistic
story behind factual recall is more complex than previously thought. It
comprises several distinct, independent, and qualitatively different mechanisms
that additively combine, constructively interfering on the correct attribute.
We term this generic phenomena the additive motif: models compute through
summing up multiple independent contributions. Each mechanism's contribution
may be insufficient alone, but summing results in constructive interfere on the
correct answer. In addition, we extend the method of direct logit attribution
to attribute an attention head's output to individual source tokens. We use
this technique to unpack what we call `mixed heads' -- which are themselves a
pair of two separate additive updates from different source tokens.

基于 Transformer 的大型语言模型（LLMs）如何存储和检索知识？我们关注了这个任务的最基本形式 —— 事实召回，其中模型被要求在形如 “事实：斗兽场位于国家” 的提示中明确地呈现存储的事实。我们发现，基于事实召回的机制比之前认为的更为复杂。它包括几个不同、独立且具有不同质量的机制，这些机制通过加法组合，在正确的属性上进行构造性干扰。我们将这种通用现象称为加性模式：模型通过对多个独立的贡献求和来计算。每个机制的贡献本身可能不足够，但求和的结果在正确答案上产生了构造性干扰。此外，我们扩展了直接逻辑回归归因法的方法，将注意力头的输出归因给单个源标记。我们使用这种技术来解包我们所称的 ' 混合头部 '—— 它们本身是来自不同源标记的两个独立的加性更新的一对。

总结事实：LLMs 中事实回忆的叠加机制

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

We provide concrete evidence for memory management in a 4-layer transformer.
Specifically, we identify clean-up behavior, in which model components
consistently remove the output of preceeding components during a forward pass.
Our findings suggest that the interpretability technique Direct Logit
Attribution provides misleading results. We show explicit examples where this
technique is inaccurate, as it does not account for clean-up behavior.

我们提供了一个 4 层变压器的内存管理的具体证据，具体而言，我们确定了清理行为，即模型组件在前向传递期间持续删除先前组件的输出。我们的研究结果表明，可解释性技术 Direct Logit Attribution 提供了误导性的结果，我们展示了具体示例证明该技术不考虑清理行为是不准确的。