Recent work has developed optimization procedures to find token sequences,
called adversarial triggers, which can elicit unsafe responses from aligned
language models. These triggers are believed to be universally transferable,
i.e., a trigger optimized on one model can jailbreak other models. In this
paper, we concretely show that such adversarial triggers are not universal. We
extensively investigate trigger transfer amongst 13 open models and observe
inconsistent transfer. Our experiments further reveal a significant difference
in robustness to adversarial triggers between models Aligned by Preference
Optimization (APO) and models Aligned by Fine-Tuning (AFT). We find that APO
models are extremely hard to jailbreak even when the trigger is optimized
directly on the model. On the other hand, while AFT models may appear safe on
the surface, exhibiting refusals to a range of unsafe instructions, we show
that they are highly susceptible to adversarial triggers. Lastly, we observe
that most triggers optimized on AFT models also generalize to new unsafe
instructions from five diverse domains, further emphasizing their
vulnerability. Overall, our work highlights the need for more comprehensive
safety evaluations for aligned language models.

针对通过优化模型找到的诱发不安全响应的标记序列，即对抗触发器，我们研究了它们的转移能力、鲁棒性以及优化方法对模型的影响，发现对齐的偏好优化模型（APO）极难被破解，然而对齐的微调模型（AFT）对对抗触发器非常敏感，并且大部分针对 AFT 模型优化的触发器还可以泛化到来自五个不同领域的新的不安全指令，突显了它们的脆弱性。因此，我们的工作强调了对对齐语言模型进行更全面安全评估的必要性。

通用对抗触发器并非通用

Universal Adversarial Triggers Are Not Universal

Prompt-based learning paradigm bridges the gap between pre-training and
fine-tuning, and works effectively under the few-shot setting. However, we find
that this learning paradigm inherits the vulnerability from the pre-training
stage, where model predictions can be misled by inserting certain triggers into
the text. In this paper, we explore this universal vulnerability by either
injecting backdoor triggers or searching for adversarial triggers on
pre-trained language models using only plain text. In both scenarios, we
demonstrate that our triggers can totally control or severely decrease the
performance of prompt-based models fine-tuned on arbitrary downstream tasks,
reflecting the universal vulnerability of the prompt-based learning paradigm.
Further experiments show that adversarial triggers have good transferability
among language models. We also find conventional fine-tuning models are not
vulnerable to adversarial triggers constructed from pre-trained language
models. We conclude by proposing a potential solution to mitigate our attack
methods. Code and data are publicly available at
this https URL

本文研究了 Prompt-based learning 模式的普适漏洞，发现加入特定的触发器可以完全控制和降低其性能，并提出了缓解攻击方法的潜在解决方案。

探究基于提示学习范式的普遍性漏洞

Exploring the Universal Vulnerability of Prompt-based Learning Paradigm

We present a general approach towards controllable societal biases in natural
language generation (NLG). Building upon the idea of adversarial triggers, we
develop a method to induce societal biases in generated text when input prompts
contain mentions of specific demographic groups. We then analyze two scenarios:
1) inducing negative biases for one demographic and positive biases for another
demographic, and 2) equalizing biases between demographics. The former scenario
enables us to detect the types of biases present in the model. Specifically, we
show the effectiveness of our approach at facilitating bias analysis by finding
topics that correspond to demographic inequalities in generated text and
comparing the relative effectiveness of inducing biases for different
demographics. The second scenario is useful for mitigating biases in downstream
applications such as dialogue generation. In our experiments, the mitigation
technique proves to be effective at equalizing the amount of biases across
demographics while simultaneously generating less negatively biased text
overall.

我们提出了一种通用方法来控制自然语言生成中的社会偏见。通过对特定人口群体进行输入提示的提及，我们开发了一种诱发社会偏见的方法，并对两种情况进行了分析：在一种人口群体中诱发负面偏见，同时在另一种人口群体中诱发正面偏见，并使偏见在不同人口群体之间相等。该方法被证明在减轻偏见过程中是有效的。