Jailbreaking attacks can enable Large Language Models (LLMs) to bypass the
safeguard and generate harmful content. Existing jailbreaking defense methods
have failed to address the fundamental issue that harmful knowledge resides
within the model, leading to potential jailbreak risks for LLMs. In this paper,
we propose a novel defense method called Eraser, which mainly includes three
goals: unlearning harmful knowledge, retaining general knowledge, and
maintaining safety alignment. The intuition is that if an LLM forgets the
specific knowledge required to answer a harmful question, it will no longer
have the ability to answer harmful questions. The training of Erase does not
actually require the model's own harmful knowledge, and it can benefit from
unlearning general answers related to harmful queries, which means it does not
need assistance from the red team. The experimental results show that Eraser
can significantly reduce the jailbreaking success rate for various attacks
without compromising the general capabilities of the model.

本文介绍了一种名为 Eraser 的新型防御方法，它能够有效减少各种攻击对模型的越狱成功率，而不影响模型的一般能力。

Eraser: 大语言模型中逆向防御通过遗忘有害知识

Eraser: Jailbreaking Defense in Large Language Models via Unlearning  Harmful Knowledge

Employing Large Language Models (LLMs) for semantic parsing has achieved
remarkable success. However, we find existing methods fall short in terms of
reliability and efficiency when hallucinations are encountered. In this paper,
we address these challenges with a framework called QueryAgent, which solves a
question step-by-step and performs step-wise self-correction. We introduce an
environmental feedback-based self-correction method called ERASER. Unlike
traditional approaches, ERASER leverages rich environmental feedback in the
intermediate steps to perform selective and differentiated self-correction only
when necessary. Experimental results demonstrate that QueryAgent notably
outperforms all previous few-shot methods using only one example on GrailQA and
GraphQ by 7.0 and 15.0 F1. Moreover, our approach exhibits superiority in terms
of efficiency, including runtime, query overhead, and API invocation costs. By
leveraging ERASER, we further improve another baseline (i.e., AgentBench) by
approximately 10 points, revealing the strong transferability of our approach.

使用大型语言模型（LLMs）进行语义解析已经取得了显著的成功。本论文提出了一种名为 QueryAgent 的框架，通过逐步解决问题和进行自我纠正来解决可靠性和效率不足的问题。通过利用丰富的环境反馈，ERASER 方法在中间步骤中仅在必要时进行选择性和差异化的自我纠正。实验结果表明，QueryAgent 相比 GrailQA 和 GraphQ 上的所有先前的少样本方法，在仅使用一个示例的情况下，F1 值提高了 7.0 和 15.0。此外，我们的方法在运行时间、查询开销和 API 调用成本等方面表现出了优越性。通过利用 ERASER，我们进一步提高了另一个基准（即 AgentBench）约 10 个点，揭示了我们方法的强大可迁移性。