Can we localize the weights and mechanisms used by a language model to
memorize and recite entire paragraphs of its training data? In this paper, we
show that while memorization is spread across multiple layers and model
components, gradients of memorized paragraphs have a distinguishable spatial
pattern, being larger in lower model layers than gradients of non-memorized
examples. Moreover, the memorized examples can be unlearned by fine-tuning only
the high-gradient weights. We localize a low-layer attention head that appears
to be especially involved in paragraph memorization. This head is predominantly
focusing its attention on distinctive, rare tokens that are least frequent in a
corpus-level unigram distribution. Next, we study how localized memorization is
across the tokens in the prefix by perturbing tokens and measuring the caused
change in the decoding. A few distinctive tokens early in a prefix can often
corrupt the entire continuation. Overall, memorized continuations are not only
harder to unlearn, but also to corrupt than non-memorized ones.

我们研究了语言模型中记忆和背诵整个段落时使用的权重和机制是否可以被定位，我们发现记忆分布在多个层和模型组件中，而记忆段落的渐变具有可辨别的空间模式，较低模型层的渐变比非记忆示例的渐变更大。此外，只需通过微调高渐变权重即可取消对记忆示例的学习。我们定位了一个似乎特别参与段落记忆的低层注意头。该注意头主要关注在语料库级别的单字分布中最不频繁的独特罕见标记。此外，我们通过扰动标记并测量解码中引起的变化来研究记忆化在前缀中的定位。前缀中的几个独特标记往往会破坏整个延续。总体而言，记忆化延续不仅更难取消学习，也更难破坏。

语言模型中的段落记忆定位

Localizing Paragraph Memorization in Language Models

We present a single attention head in GPT-2 Small that has one main role
across the entire training distribution. If components in earlier layers
predict a certain token, and this token appears earlier in the context, the
head suppresses it: we call this copy suppression. Attention Head 10.7 (L10H7)
suppresses naive copying behavior which improves overall model calibration.
This explains why multiple prior works studying certain narrow tasks found
negative heads that systematically favored the wrong answer. We uncover the
mechanism that the Negative Heads use for copy suppression with weights-based
evidence and are able to explain 76.9% of the impact of L10H7 in GPT-2 Small.
To the best of our knowledge, this is the most comprehensive description of the
complete role of a component in a language model to date. One major effect of
copy suppression is its role in self-repair. Self-repair refers to how ablating
crucial model components results in downstream neural network parts
compensating for this ablation. Copy suppression leads to self-repair: if an
initial overconfident copier is ablated, then there is nothing to suppress. We
show that self-repair is implemented by several mechanisms, one of which is
copy suppression, which explains 39% of the behavior in a narrow task.
Interactive visualisations of the copy suppression phenomena may be seen at our
web app this https URL

GPT-2 Small 模型的 Attention Head 10.7 (L10H7) 通过抑制复制行为实现模型校准和自修复。