Large language models (LLMs) excel in various capabilities but also pose
safety risks such as generating harmful content and misinformation, even after
safety alignment. In this paper, we explore the inner mechanisms of safety
alignment from the perspective of mechanistic interpretability, focusing on
identifying and analyzing safety neurons within LLMs that are responsible for
safety behaviors. We propose generation-time activation contrasting to locate
these neurons and dynamic activation patching to evaluate their causal effects.
Experiments on multiple recent LLMs show that: (1) Safety neurons are sparse
and effective. We can restore $90$% safety performance with intervention only
on about $5$% of all the neurons. (2) Safety neurons encode transferrable
mechanisms. They exhibit consistent effectiveness on different red-teaming
datasets. The finding of safety neurons also interprets "alignment tax". We
observe that the identified key neurons for safety and helpfulness
significantly overlap, but they require different activation patterns of the
shared neurons. Furthermore, we demonstrate an application of safety neurons in
detecting unsafe outputs before generation. Our findings may promote further
research on understanding LLM alignment. The source codes will be publicly
released to facilitate future research.

我们通过从机理解释的角度探索安全对齐的内在机制，重点是识别和分析大语言模型中负责安全行为的安全神经元。我们提出了生成时激活对比来定位这些神经元，并提出了动态激活修复来评估其因果效应。多个最新的大语言模型的实验证明：（1）安全神经元是稀疏而有效的。只通过对大约 5％的神经元进行干预，我们可以恢复 90％的安全性能。 （2）安全神经元编码可转移的机制。它们在不同的红队测试数据集上表现出一致的有效性。安全神经元的发现还解释了 “对齐税” 的现象。我们观察到，安全和有用的关键神经元明显重叠，但它们对共享神经元的激活模式有不同要求。此外，我们展示了在生成之前使用安全神经元检测不安全输出的应用。我们的发现可能促进进一步研究理解大语言模型的对齐。源代码将公开发布以促进未来的研究。