Recent work has shown it is possible to construct adversarial examples that
cause an aligned language model to emit harmful strings or perform harmful
behavior. Existing attacks work either in the white-box setting (with full
access to the model weights), or through transferability: the phenomenon that
adversarial examples crafted on one model often remain effective on other
models. We improve on prior work with a query-based attack that leverages API
access to a remote language model to construct adversarial examples that cause
the model to emit harmful strings with (much) higher probability than with
transfer-only attacks. We validate our attack on GPT-3.5 and OpenAI's safety
classifier; we can cause GPT-3.5 to emit harmful strings that current transfer
attacks fail at, and we can evade the safety classifier with nearly 100%
probability.

通过使用具有 API 访问的远程语言模型构建具有更高概率发出有害字符串的对抗性示例，我们改进了之前的工作，并验证了我们的攻击在 GPT-3.5 和 OpenAI 的安全分类器上的有效性。

基于查询的对抗性提示生成

Query-Based Adversarial Prompt Generation

Large Language Models' safety remains a critical concern due to their
vulnerability to adversarial attacks, which can prompt these systems to produce
harmful responses. In the heart of these systems lies a safety classifier, a
computational model trained to discern and mitigate potentially harmful,
offensive, or unethical outputs. However, contemporary safety classifiers,
despite their potential, often fail when exposed to inputs infused with
adversarial noise. In response, our study introduces the Adversarial Prompt
Shield (APS), a lightweight model that excels in detection accuracy and
demonstrates resilience against adversarial prompts. Additionally, we propose
novel strategies for autonomously generating adversarial training datasets,
named Bot Adversarial Noisy Dialogue (BAND) datasets. These datasets are
designed to fortify the safety classifier's robustness, and we investigate the
consequences of incorporating adversarial examples into the training process.
Through evaluations involving Large Language Models, we demonstrate that our
classifier has the potential to decrease the attack success rate resulting from
adversarial attacks by up to 60%. This advancement paves the way for the next
generation of more reliable and resilient conversational agents.

大型语言模型的安全性是一个重要问题，本研究提出了 Adversarial Prompt Shield（APS）这个轻量级模型，能够有效检测和抵御对抗抓取；同时，我们还引入了自动生成对抗训练数据集的新策略，命名为 Bot Adversarial Noisy Dialogue（BAND）数据集，以提高安全分类器的鲁棒性。经过评估，我们的分类器成功率提高了 60%，为下一代更可靠和韧性更强的对话代理铺平了道路。

大型语言模型的强大安全分类器：对抗性提示屏蔽

Robust Safety Classifier for Large Language Models: Adversarial Prompt  Shield

Dialogue safety problems severely limit the real-world deployment of neural
conversational models and have attracted great research interests recently.
However, dialogue safety problems remain under-defined and the corresponding
dataset is scarce. We propose a taxonomy for dialogue safety specifically
designed to capture unsafe behaviors in human-bot dialogue settings, with
focuses on context-sensitive unsafety, which is under-explored in prior works.
To spur research in this direction, we compile DiaSafety, a dataset with rich
context-sensitive unsafe examples. Experiments show that existing safety
guarding tools fail severely on our dataset. As a remedy, we train a dialogue
safety classifier to provide a strong baseline for context-sensitive dialogue
unsafety detection. With our classifier, we perform safety evaluations on
popular conversational models and show that existing dialogue systems still
exhibit concerning context-sensitive safety problems.

在人 - 机对话设定中，我们为对话安全性提出了一种专门捕捉不安全行为的分类法，重点在于对先前的探讨不足的上下文敏感性不安全性的关注，并编制了一个包含丰富上下文的不安全示例的数据集 DiaSafety，实验证明现有的安全保护工具严重失败。为此，我们训练了一个对话安全性分类器来提供上下文敏感对话不安全性检测的强大基线，在流行的对话模型上执行安全评估，并展示现有的对话系统仍然存在令人关注的上下文敏感安全问题。