Feb, 2024
LLMs 在实践中可以自我防御破解:一篇展望性论文
LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
Daoyuan Wu, Shuai Wang, Yang Liu, Ning Liu
TL;DRJailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models. This paper proposes a lightweight yet practical defense called SELFDEFEND, which can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts.