BriefGPT.xyz
Apr, 2025
NeuRel-Attack:大型语言模型中用于安全失调的神经元重学习
NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models
HTML
PDF
Yi Zhou, Wenpeng Xing, Dezhang Kong, Changting Lin, Meng Han
TL;DR
本研究针对现有大型语言模型在安全对齐过程中存在的漏洞,提出了一种通过识别和修改负责安全约束的神经元以诱发失调的新方法。通过实验,我们的方法能够有效去除安全约束,突显了现有对齐技术的关键脆弱性,强调了需要加强对抗性微调攻击的防御。
Abstract
Safety Alignment
in
Large Language Models
(LLMs) is achieved through fine-tuning mechanisms that regulate neuron activations to suppress harmful content. In this work, we propose a novel approach to induce disali
→