Feb, 2024

LLMs 在实践中可以自我防御破解:一篇展望性论文

TL;DRJailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models. This paper proposes a lightweight yet practical defense called SELFDEFEND, which can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts.