BriefGPT.xyz
May, 2024
大规模语言模型的惰性安全对齐防止有害微调
Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning
HTML
PDF
Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu
TL;DR
通过精细调整的安全对齐技术,解决了大型语言模型在混合有害数据的数据集上进行微调后可能出现的问题,提出了一种双状态优化解决方案,引入了近端项来限制状态的偏移,实验证明这种方法可以显著提高对齐性能并保持用户任务上的准确性。
Abstract
Recent studies show that
large language models
(LLMs) with
safety alignment
can be jail-broken by
fine-tuning
on a dataset mixed with harm
→