Although Large Language Models (LLMs) have achieved tremendous success in
various applications, they are also susceptible to certain prompts that can
induce them to bypass built-in safety measures and provide dangerous or illegal
content, a phenomenon known as jailbreak. To protect LLMs from producing
harmful information, various defense strategies are proposed, with most
focusing on content filtering or adversarial training of models. In this paper,
we propose an approach named Prompt Adversarial Tuning (PAT) to train a defense
control mechanism, which is then embedded as a prefix to user prompts to
implement our defense strategy. We design a training process similar to
adversarial training to achieve our optimized goal, alternating between
updating attack and defense controls. To our knowledge, we are the first to
implement defense from the perspective of prompt tuning. Once employed, our
method will hardly impact the operational efficiency of LLMs. Experiments show
that our method is effective in both black-box and white-box settings, reducing
the success rate of advanced attacks to nearly 0 while maintaining the benign
answer rate of 80% to simple benign questions. Our work might potentially chart
a new perspective for future explorations in LLM security.

我们提出了一种名为 Prompt Adversarial Tuning (PAT) 的方法来训练一个防御控制机制，将其作为用户提示的前缀来实施我们的防御策略，该方法在黑盒和白盒设置中表现有效，在几乎不影响操作效率的情况下，将高级攻击的成功率降低到几乎为 0，同时仍然保持对简单问题的良性回答率为 80%。我们的研究在 LLM 安全领域可能为未来的探索开辟新的视角。