微调对语言模型毒性的影响

Oct, 2024

The effect of fine-tuning on language model toxicity

Will Hawkins, Brent Mittelstadt, Chris Russell

TL;DR本研究解决了微调语言模型可能导致的安全性问题，探讨其对不同开放模型生成有毒内容倾向的影响。通过对Gemma、Llama和Phi模型的三个实验，我们发现少量的高效参数微调可以显著改变模型的毒性表现，并揭示了社区贡献者微调模型在实际应用中可能出现的不确定性。

Abstract

fine-tuning language models has become increasingly popular following the proliferation of open models and improvements in cost-effective parameter efficient →