Localizing behaviors of neural networks to a subset of the network's
components or a subset of interactions between components is a natural first
step towards analyzing network mechanisms and possible failure modes. Existing
work is often qualitative and ad-hoc, and there is no consensus on the
appropriate way to evaluate localization claims. We introduce path patching, a
technique for expressing and quantitatively testing a natural class of
hypotheses expressing that behaviors are localized to a set of paths. We refine
an explanation of induction heads, characterize a behavior of GPT-2, and open
source a framework for efficiently running similar experiments.

本文介绍了一种名为 “路径修补” 的技术，通过该技术可以对神经网络的本地化行为进行量化测试，从而分析网络机制和可能的故障模式，并通过对 GPT-2 的行为进行表征来优化了归纳头的解释，并开源了一个运行类似实验的框架。

路径修补定位模型行为

Localizing Model Behavior with Path Patching

"Induction heads" are attention heads that implement a simple algorithm to
complete token sequences like [A][B] ... [A] -> [B]. In this work, we present
preliminary and indirect evidence for a hypothesis that induction heads might
constitute the mechanism for the majority of all "in-context learning" in large
transformer models (i.e. decreasing loss at increasing token indices). We find
that induction heads develop at precisely the same point as a sudden sharp
increase in in-context learning ability, visible as a bump in the training
loss. We present six complementary lines of evidence, arguing that induction
heads may be the mechanistic source of general in-context learning in
transformer models of any size. For small attention-only models, we present
strong, causal evidence; for larger models with MLPs, we present correlational
evidence.

本文从六个方面提出了假设，指出 “感应头” 可能构成了大型变换器模型中大部分 “上下文学习” 的机制。同时，通过强因果证据和相关性证据，证明了这种感应头可能是任何大小的变压器模型中一般情况下上下文学习的来源。