BriefGPT.xyz
Jul, 2023
越狱:LLM安全培训如何失败?
Jailbroken: How Does LLM Safety Training Fail?
HTML
PDF
Alexander Wei, Nika Haghtalab, Jacob Steinhardt
TL;DR
本文研究大型语言模型中的安全问题,提出两种安全训练的失败模式,分别是竞争目标和广义不符合。作者发现,这些安全问题无法通过红队测试和安全培训解决,并提出需要将安全机制的复杂度与模型的能力相匹配。
Abstract
large language models
trained for safety and harmlessness remain susceptible to
adversarial misuse
, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit undesired behavio
→