BriefGPT.xyz
Oct, 2023
自护:赋予LLM保护自身能力
Self-Guard: Empower the LLM to Safeguard Itself
HTML
PDF
Zezhong Wang, Fangkai Yang, Lu Wang, Pu Zhao, Hongru Wang...
TL;DR
利用自我保护方法(Self-Guard)来解决语言模型(LLM)被越狱攻击的问题,包括增强模型对有害内容的检测能力以及指导模型在自我响应中进行有害内容检测,实验证明自我保护方法对抵御越狱攻击具有鲁棒性且不会降低LLM的性能。
Abstract
The jailbreak attack can bypass the safety measures of a
large language model
(LLM), generating harmful content. This misuse of LLM has led to negative societal consequences. Currently, there are two main approaches to address
→