LLMs are known to be vulnerable to jailbreak attacks, even after safety
alignment. An important observation is that, while different types of jailbreak
attacks can generate significantly different queries, they mostly result in
similar responses that are rooted in the same harmful knowledge (e.g., detailed
steps to make a bomb). Therefore, we conjecture that directly unlearn the
harmful knowledge in the LLM can be a more effective way to defend against
jailbreak attacks than the mainstream supervised fine-tuning (SFT) based
approaches. Our extensive experiments confirmed our insight and suggested
surprising generalizability of our unlearning-based approach: using only 20 raw
harmful questions \emph{without} any jailbreak prompt during training, our
solution reduced the Attack Success Rate (ASR) in Vicuna-7B on
\emph{out-of-distribution} (OOD) harmful questions wrapped with various complex
jailbreak prompts from 82.6\% to 7.7\%. This significantly outperforms
Llama2-7B-Chat, which is fine-tuned on about 0.1M safety alignment samples but
still has an ASR of 21.9\% even under the help of an additional safety system
prompt. Further analysis reveals that the generalization ability of our
solution stems from the intrinsic relatedness among harmful responses across
harmful questions (e.g., response patterns, shared steps and actions, and
similarity among their learned representations in the LLM). Our code is
available at https://github.com/thu-coai/SafeUnlearning.

将有害知识在 LLM 中直接取消学习是一种有效防御越狱攻击的方法，实验证实其具有意外的普适性，将攻击成功率从 82.6％降低到 7.7％。

安全的消去学习：一个出人意料地有效且具有普适性的解决方案，用于防御越狱攻击

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to  Defend Against Jailbreak Attacks

The rapid advancement of Large Language Models (LLMs) has demonstrated their
vast potential across various domains, attributed to their extensive
pretraining knowledge and exceptional generalizability. However, LLMs often
encounter challenges in generating harmful content when faced with problematic
prompts. To address this problem, existing work attempted to implement a
gradient ascent based approach to prevent LLMs from producing harmful output.
While these methods can be effective, they frequently impact the model utility
in responding to normal prompts. To address this gap, we introduce Selective
Knowledge negation Unlearning (SKU), a novel unlearning framework for LLMs,
designed to eliminate harmful knowledge while preserving utility on normal
prompts. Specifically, SKU is consisted of two stages: harmful knowledge
acquisition stage and knowledge negation stage. The first stage aims to
identify and acquire harmful knowledge within the model, whereas the second is
dedicated to remove this knowledge. SKU selectively isolates and removes
harmful knowledge in model parameters, ensuring the model's performance remains
robust on normal prompts. Our experiments conducted across various LLM
architectures demonstrate that SKU identifies a good balance point between
removing harmful information and preserving utility.

通过选择性知识否定消除（SKU）框架，我们可以有效地识别和去除大语言模型中的有害知识，同时保持模型对正常提示的有效性。