BriefGPT.xyz
May, 2024
基于子空间导向模型融合的大型语言模型安全重整框架
A safety realignment framework via subspace-oriented model fusion for large language models
HTML
PDF
Xin Yi, Shunfan Zheng, Linlin Wang, Xiaoling Wang, Liang He
TL;DR
通过子空间导向模型融合(SOMF)的安全重新对齐框架,本研究旨在将初始对齐模型和当前精细调整后的模型的安全能力结合到重新对齐的模型中,验证了该框架在维持安全性的同时不明显损害下游任务的性能。
Abstract
The current
safeguard mechanisms
for large language models (LLMs) are indeed susceptible to
jailbreak attacks
, making them inherently fragile. Even the process of
→