In recent years, Text-to-Image (T2I) models have seen remarkable
advancements, gaining widespread adoption. However, this progress has
inadvertently opened avenues for potential misuse, particularly in generating
inappropriate or Not-Safe-For-Work (NSFW) content. Our work introduces
MMA-Diffusion, a framework that presents a significant and realistic threat to
the security of T2I models by effectively circumventing current defensive
measures in both open-source models and commercial online services. Unlike
previous approaches, MMA-Diffusion leverages both textual and visual modalities
to bypass safeguards like prompt filters and post-hoc safety checkers, thus
exposing and highlighting the vulnerabilities in existing defense mechanisms.

近年来，文本到图像（T2I）模型取得了显著进展并广泛应用，然而这一进展无意中开辟了潜在的滥用途径，尤其是生成不适宜或不安全的内容。我们的工作引入了 MMA-Diffusion，这是一个对 T2I 模型安全性构成严重和真实威胁的框架，通过有效绕过开源模型和商业在线服务的当前防御措施。与以往的方法不同，MMA-Diffusion 利用文本和视觉模式来绕过提示过滤器和事后安全检查器等保护措施，从而揭示现有防御机制的弱点。

MMA-Diffusion：多模态对抗扩散模型

MMA-Diffusion: MultiModal Attack on Diffusion Models

Text-to-image generative models such as Stable Diffusion and DALL$\cdot$E 2
have attracted much attention since their publication due to their wide
application in the real world. One challenging problem of text-to-image
generative models is the generation of Not-Safe-for-Work (NSFW) content, e.g.,
those related to violence and adult. Therefore, a common practice is to deploy
a so-called safety filter, which blocks NSFW content based on either text or
image features. Prior works have studied the possible bypass of such safety
filters. However, existing works are largely manual and specific to Stable
Diffusion's official safety filter. Moreover, the bypass ratio of Stable
Diffusion's safety filter is as low as 23.51% based on our evaluation.
In this paper, we propose the first automated attack framework, called
SneakyPrompt, to evaluate the robustness of real-world safety filters in
state-of-the-art text-to-image generative models. Our key insight is to search
for alternative tokens in a prompt that generates NSFW images so that the
generated prompt (called an adversarial prompt) bypasses existing safety
filters. Specifically, SneakyPrompt utilizes reinforcement learning (RL) to
guide an agent with positive rewards on semantic similarity and bypass success.
Our evaluation shows that SneakyPrompt successfully generated NSFW content
using an online model DALL$\cdot$E 2 with its default, closed-box safety filter
enabled. At the same time, we also deploy several open-source state-of-the-art
safety filters on a Stable Diffusion model and show that SneakyPrompt not only
successfully generates NSFW content, but also outperforms existing adversarial
attacks in terms of the number of queries and image qualities.

该研究提出了一种名为 SneakyPrompt 的自动化攻击框架，使用强化学习来生成可以绕过现有文本到图像生成模型的安全过滤器的不安全内容。实验表明，SneakyPrompt 不仅可以成功生成 NSFW 内容，而且在查询数量和图像质量方面也优于现有的对抗攻击。