Concept erasure in language models has traditionally lacked a comprehensive evaluation framework, leading to incomplete assessments of effectiveness of erasure methods. We propose an evaluation paradigm centered on three critical criteria: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance). Our evaluation metrics naturally motivate the development of Erasure of Language Memory (ELM), a new method designed to address all three dimensions. ELM employs targeted low-rank updates to alter output distributions for erased concepts while preserving overall model capabilities including fluency when prompted for an erased concept. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative analysis shows that ELM achieves superior performance across our proposed metrics, including near-random scores on erased topic assessments, generation fluency, maintained accuracy on unrelated benchmarks, and robustness under adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info

本研究解决了语言模型中概念抹除方法缺乏全面评估框架的问题，提出了一种基于无辜性、无缝性和特异性三个关键标准的评估范式。通过发展新方法“语言记忆抹除（ELM）”，实现对概念的有效抹除，同时保持生成的流畅性和与不相关任务的性能。研究表明，ELM在生物安全、网络安全和文学领域的应用中表现优异，有望推动相关领域的研究进展。

从语言模型中抹除概念知识