BriefGPT.xyz
Jul, 2024
安全的消去学习:一个出人意料地有效且具有普适性的解决方案,用于防御越狱攻击
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
HTML
PDF
Zhexin Zhang, Junxiao Yang, Pei Ke, Shiyao Cui, Chujie Zheng...
TL;DR
将有害知识在LLM中直接取消学习是一种有效防御越狱攻击的方法,实验证实其具有意外的普适性,将攻击成功率从82.6%降低到7.7%。
Abstract
llms
are known to be vulnerable to
jailbreak attacks
, even after safety alignment. An important observation is that, while different types of
jai
→