Although safely enhanced Large Language Models (LLMs) have achieved
remarkable success in tackling various complex tasks in a zero-shot manner,
they remain susceptible to jailbreak attacks, particularly the unknown
jailbreak attack. To enhance LLMs' generalized defense capabilities, we propose
a two-stage adversarial tuning framework, which generates adversarial prompts
to explore worst-case scenarios by optimizing datasets containing pairs of
adversarial prompts and their safe responses. In the first stage, we introduce
the hierarchical meta-universal adversarial prompt learning to efficiently and
effectively generate token-level adversarial prompts. In the second stage, we
propose the automatic adversarial prompt learning to iteratively refine
semantic-level adversarial prompts, further enhancing LLM's defense
capabilities. We conducted comprehensive experiments on three widely used
jailbreak datasets, comparing our framework with six defense baselines under
five representative attack scenarios. The results underscore the superiority of
our proposed methods. Furthermore, our adversarial tuning framework exhibits
empirical generalizability across various attack strategies and target LLMs,
highlighting its potential as a transferable defense mechanism.

通过优化包含对抗性提示及其安全响应的数据集，我们提出了一个两阶段的对抗调整框架，用于增强大型语言模型在防御能力方面的广义性，实验证明了我们方法的优越性，并展示了它作为可传输防御机制的潜力。