BriefGPT.xyz
Jun, 2024
对抗调整:为LLMs防御越狱攻击
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
HTML
PDF
Fan Liu, Zhao Xu, Hao Liu
TL;DR
通过优化包含对抗性提示及其安全响应的数据集,我们提出了一个两阶段的对抗调整框架,用于增强大型语言模型在防御能力方面的广义性,实验证明了我们方法的优越性,并展示了它作为可传输防御机制的潜力。
Abstract
Although safely enhanced
large language models
(LLMs) have achieved remarkable success in tackling various complex tasks in a zero-shot manner, they remain susceptible to
jailbreak attacks
, particularly the unkno
→