Safety alignment in large language models (LLMs) is achieved through fine-tuning mechanisms that regulate neuron activations to suppress harmful content. In this work, we propose a novel approach to induce disalignment by identifying and modifying the neurons responsible for safety constraints. Our method consists of three key steps: Neuron Activation Analysis, where we examine activation patterns in response to harmful and harmless prompts to detect neurons that are critical for distinguishing between harmful and harmless inputs; Similarity-Based Neuron Identification, which systematically locates the neurons responsible for safe alignment; and Neuron Relearning for Safety Removal, where we fine-tune these selected neurons to restore the model's ability to generate previously restricted responses. Experimental results demonstrate that our method effectively removes safety constraints with minimal fine-tuning, highlighting a critical vulnerability in current alignment techniques. Our findings underscore the need for robust defenses against adversarial fine-tuning attacks on LLMs.

本研究针对现有大型语言模型在安全对齐过程中存在的漏洞，提出了一种通过识别和修改负责安全约束的神经元以诱发失调的新方法。通过实验，我们的方法能够有效去除安全约束，突显了现有对齐技术的关键脆弱性，强调了需要加强对抗性微调攻击的防御。

NeuRel-Attack：大型语言模型中用于安全失调的神经元重学习