Recent work has shown it is possible to construct adversarial examples that
cause an aligned language model to emit harmful strings or perform harmful
behavior. Existing attacks work either in the white-box setting (with full
access to the model weights), or through transferability: the phenomenon that
adversarial examples crafted on one model often remain effective on other
models. We improve on prior work with a query-based attack that leverages API
access to a remote language model to construct adversarial examples that cause
the model to emit harmful strings with (much) higher probability than with
transfer-only attacks. We validate our attack on GPT-3.5 and OpenAI's safety
classifier; we can cause GPT-3.5 to emit harmful strings that current transfer
attacks fail at, and we can evade the safety classifier with nearly 100%
probability.

通过使用具有 API 访问的远程语言模型构建具有更高概率发出有害字符串的对抗性示例，我们改进了之前的工作，并验证了我们的攻击在 GPT-3.5 和 OpenAI 的安全分类器上的有效性。

基于查询的对抗性提示生成

Query-Based Adversarial Prompt Generation

In the scenario of black-box adversarial attack, the target model's
parameters are unknown, and the attacker aims to find a successful adversarial
perturbation based on query feedback under a query budget. Due to the limited
feedback information, existing query-based black-box attack methods often
require many queries for attacking each benign example. To reduce query cost,
we propose to utilize the feedback information across historical attacks,
dubbed example-level adversarial transferability. Specifically, by treating the
attack on each benign example as one task, we develop a meta-learning framework
by training a meta-generator to produce perturbations conditioned on benign
examples. When attacking a new benign example, the meta generator can be
quickly fine-tuned based on the feedback information of the new task as well as
a few historical attacks to produce effective perturbations. Moreover, since
the meta-train procedure consumes many queries to learn a generalizable
generator, we utilize model-level adversarial transferability to train the
meta-generator on a white-box surrogate model, then transfer it to help the
attack against the target model. The proposed framework with the two types of
adversarial transferability can be naturally combined with any off-the-shelf
query-based attack methods to boost their performance, which is verified by
extensive experiments.

文章提出了利用历史攻击反馈信息来降低黑盒对抗攻击的查询成本，通过开发一个元学习框架来训练元 Perturbation 生成器，以产生有效的干扰，同时利用模型级对抗可迁移性来训练元生成器，以帮助攻击目标模型。这个框架可以与任何现成的基于查询攻击方法结合使用，以提高攻击性能。

元学习通用黑盒对抗攻击

Generalizable Black-Box Adversarial Attack with Meta Learning

We study the query-based attack against image retrieval to evaluate its
robustness against adversarial examples under the black-box setting, where the
adversary only has query access to the top-k ranked unlabeled images from the
database. Compared with query attacks in image classification, which produce
adversaries according to the returned labels or confidence score, the challenge
becomes even more prominent due to the difficulty in quantifying the attack
effectiveness on the partial retrieved list. In this paper, we make the first
attempt in Query-based Attack against Image Retrieval (QAIR), to completely
subvert the top-k retrieval results. Specifically, a new relevance-based loss
is designed to quantify the attack effects by measuring the set similarity on
the top-k retrieval results before and after attacks and guide the gradient
optimization. To further boost the attack efficiency, a recursive model
stealing method is proposed to acquire transferable priors on the target model
and generate the prior-guided gradients. Comprehensive experiments show that
the proposed attack achieves a high attack success rate with few queries
against the image retrieval systems under the black-box setting. The attack
evaluations on the real-world visual search engine show that it successfully
deceives a commercial system such as Bing Visual Search with 98% attack success
rate by only 33 queries on average.

本研究通过 Quantifying the Attack Effects on the Partial Retrieved List 方法，提出了第一种针对黑盒情况下图片检索的基于查询的攻击（QAIR）方法，并采用了新的基于相关性的损失设计和递归模型窃取方法，通过少量请求，成功地欺骗了 Bing 视觉搜索这样的商业系统。