Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. To better understand the underlying factors that make models safe via safety fine-tuning, we design a synthetic data generation framework that captures salient aspects of an unsafe input by modeling the interaction between the task the model is asked to perform (e.g., ``design'') versus the specific concepts the task is asked to be performed upon (e.g., a ``cycle'' vs. a ``bomb''). Using this, we investigate three well-known safety fine-tuning methods -- supervised safety fine-tuning, direct preference optimization, and unlearning -- and provide significant evidence demonstrating that these methods minimally transform MLP weights to specifically align unsafe inputs into its weights' null space. This yields a clustering of inputs based on whether the model deems them safe or not. Correspondingly, when an adversarial input (e.g., a jailbreak) is provided, its activations are closer to safer samples, leading to the model processing such an input as if it were safe. We validate our findings, wherever possible, on real-world models -- specifically, Llama-2 7B and Llama-3 8B.

通过设计合成数据生成框架，研究了三种常见的安全微调方法，即监督安全微调、直接偏好优化和遗忘学习。它们通过最小程度地转换多层感知器（MLP）权重，将不安全输入与权重的空空间相对齐，进而对输入进行聚类，确定模型是否将其视为安全。该研究还验证了这些结论在真实世界模型（Llama-2 7B和Llama-3 8B）上的可行性。

安全微调的因果研究：成效与障碍