Despite significant progress in model editing methods, their application in
real-world scenarios remains challenging as they often cause large language
models (LLMs) to collapse. Among them, ROME is particularly concerning, as it
could disrupt LLMs with only a single edit. In this paper, we study the root
causes of such collapse. Through extensive analysis, we identify two primary
factors that contribute to the collapse: i) inconsistent handling of prefixed
and unprefixed keys in the parameter update equation may result in very small
denominators, causing excessively large parameter updates; ii) the subject of
collapse cases is usually the first token, whose unprefixed key distribution
significantly differs from the prefixed key distribution in autoregressive
transformers, causing the aforementioned issue to materialize. To validate our
analysis, we propose a simple yet effective approach: uniformly using prefixed
keys during editing phase and adding prefixes during the testing phase. The
experimental results show that the proposed solution can prevent model collapse
while maintaining the effectiveness of the edits.

尽管模型编辑方法取得了显著进展，但在实际场景中应用仍然具有挑战性，因为它们经常导致大型语言模型发生崩溃。本文研究了这种崩溃的根本原因，并通过广泛的分析，确定了导致崩溃的两个主要因素。为了验证我们的分析，我们提出了一种简单而有效的方法：在编辑阶段统一使用带前缀的键，并在测试阶段添加前缀。实验结果表明，这种解决方案可以预防模型崩溃，同时保持编辑的有效性。

ROME 的倒台：对 LLMs 在模型编辑中崩溃的理解

The Fall of ROME: Understanding the Collapse of LLMs in Model Editing

Large language models (LLMs) trained on vast corpora suffer from inevitable
stereotype biases. Mitigating these biases with fine-tuning could be both
costly and data-hungry. Model editing methods, which focus on modifying LLMs in
a post-hoc manner, are of great potential to address debiasing. However, it
lacks a comprehensive study that facilitates both internal and external model
editing methods, supports various bias types, as well as understands the pros
and cons of applying editing methods to stereotypical debiasing. To mitigate
this gap, we carefully formulate social debiasing into an editing problem and
benchmark seven existing model editing algorithms on stereotypical debiasing,
i.e., debias editing. Our findings in three scenarios reveal both the potential
and challenges of debias editing: (1) Existing model editing methods can
effectively preserve knowledge and mitigate biases, while the generalization of
debias effect from edited sentences to semantically equivalent sentences is
limited.(2) Sequential editing highlights the robustness of SERAC (Mitchell et
al. 2022b), while internal editing methods degenerate with the number of edits.
(3) Model editing algorithms achieve generalization towards unseen biases both
within the same type and from different types. In light of these findings, we
further propose two simple but effective methods to improve debias editing, and
experimentally show the effectiveness of the proposed methods.

大型语言模型具有刻板印象偏见，模型编辑方法能够缓解这一问题，本研究通过综合性研究从多个角度评估了七种模型编辑算法在刻板偏见消除中的潜力和挑战，同时提出了两种简单有效的方法以提升刻板偏见的编辑效果。

模型编辑用于社会去偏倚的潜力和挑战

Potential and Challenges of Model Editing for Social Debiasing

Pretrained language models sometimes possess knowledge that we do not wish
them to, including memorized personal information and knowledge that could be
used to harm people. They can also output toxic or harmful text. To mitigate
these safety and informational issues, we propose an attack-and-defense
framework for studying the task of deleting sensitive information directly from
model weights. We study direct edits to model weights because (1) this approach
should guarantee that particular deleted information is never extracted by
future prompt attacks, and (2) it should protect against whitebox attacks,
which is necessary for making claims about safety/privacy in a setting where
publicly available model weights could be used to elicit sensitive information.
Our threat model assumes that an attack succeeds if the answer to a sensitive
question is located among a set of B generated candidates, based on scenarios
where the information would be insecure if the answer is among B candidates.
Experimentally, we show that even state-of-the-art model editing methods such
as ROME struggle to truly delete factual information from models like GPT-J, as
our whitebox and blackbox attacks can recover "deleted" information from an
edited model 38% of the time. These attacks leverage two key observations: (1)
that traces of deleted information can be found in intermediate model hidden
states, and (2) that applying an editing method for one question may not delete
information across rephrased versions of the question. Finally, we provide new
defense methods that protect against some extraction attacks, but we do not
find a single universally effective defense method. Our results suggest that
truly deleting sensitive information is a tractable but difficult problem,
since even relatively low attack success rates have potentially severe societal
implications for real-world deployment of language models.

提出了攻击和防御框架用于直接删除模型权重中的敏感信息，研究表明即使使用先进的模型编辑方法，也很难真正从语言模型中删除敏感信息，并提供了一些防御方法来抵御抽取攻击。