Can we localize the weights and mechanisms used by a language model to memorize and recite entire paragraphs of its training data? In this paper, we show that while memorization is spread across multiple layers and model components, gradients of memorized paragraphs have a distinguishable spatial pattern, being larger in lower model layers than gradients of non-memorized examples. Moreover, the memorized examples can be unlearned by fine-tuning only the high-gradient weights. We localize a low-layer attention head that appears to be especially involved in paragraph memorization. This head is predominantly focusing its attention on distinctive, rare tokens that are least frequent in a corpus-level unigram distribution. Next, we study how localized memorization is across the tokens in the prefix by perturbing tokens and measuring the caused change in the decoding. A few distinctive tokens early in a prefix can often corrupt the entire continuation. Overall, memorized continuations are not only harder to unlearn, but also to corrupt than non-memorized ones.

我们研究了语言模型中记忆和背诵整个段落时使用的权重和机制是否可以被定位，我们发现记忆分布在多个层和模型组件中，而记忆段落的渐变具有可辨别的空间模式，较低模型层的渐变比非记忆示例的渐变更大。此外，只需通过微调高渐变权重即可取消对记忆示例的学习。我们定位了一个似乎特别参与段落记忆的低层注意头。该注意头主要关注在语料库级别的单字分布中最不频繁的独特罕见标记。此外，我们通过扰动标记并测量解码中引起的变化来研究记忆化在前缀中的定位。前缀中的几个独特标记往往会破坏整个延续。总体而言，记忆化延续不仅更难取消学习，也更难破坏。

语言模型中的段落记忆定位