BriefGPT.xyz
Feb, 2024
语言模型就是霍默·辛普森!通过任务算术重新矫正经过微调的语言模型的安全性
Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic
HTML
PDF
Rishabh Bhardwaj, Do Duc Anh, Soujanya Poria
TL;DR
通过简单的算术方法,用一个安全向量对受损模型的权重进行相加,我们提出的LLM安全重新对齐方法RESTA能够有效地降低受损模型的有害性,而在任务上保持大部分模型的性能。
Abstract
Aligned
language models
face a significant limitation as their
fine-tuning
often results in compromised
safety
. To tackle this, we propose
→