With the rise of text-to-image (T2I) generative AI models reaching wide
audiences, it is critical to evaluate model robustness against non-obvious
attacks to mitigate the generation of offensive images. By focusing on
``implicitly adversarial'' prompts (those that trigger T2I models to generate
unsafe images for non-obvious reasons), we isolate a set of difficult safety
issues that human creativity is well-suited to uncover. To this end, we built
the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing
a diverse set of implicitly adversarial prompts. We have assembled a suite of
state-of-the-art T2I models, employed a simple user interface to identify and
annotate harms, and engaged diverse populations to capture long-tail safety
issues that may be overlooked in standard testing. The challenge is run in
consecutive rounds to enable a sustained discovery and analysis of safety
pitfalls in T2I models.
In this paper, we present an in-depth account of our methodology, a
systematic study of novel attack strategies and discussion of safety failures
revealed by challenge participants. We also release a companion visualization
tool for easy exploration and derivation of insights from the dataset. The
first challenge round resulted in over 10k prompt-image pairs with machine
annotations for safety. A subset of 1.5k samples contains rich human
annotations of harm types and attack styles. We find that 14% of images that
humans consider harmful are mislabeled as ``safe'' by machines. We have
identified new attack strategies that highlight the complexity of ensuring T2I
model robustness. Our findings emphasize the necessity of continual auditing
and adaptation as new vulnerabilities emerge. We are confident that this work
will enable proactive, iterative safety assessments and promote responsible
development of T2I models.

从生成图像的文本到图像（T2I）生成 AI 模型的发展中，评估模型对非明显攻击的鲁棒性至关重要。在本文中，我们通过关注 “隐式对抗” 提示（触发 T2I 模型因非明显原因生成不安全图像的提示），独立确定了一组难以发现的安全问题，而人类创造力很适合揭示这些问题。通过构建 Adversarial Nibbler Challenge，一种用于众包隐式对抗提示的红队方法，我们汇集了一套最先进的 T2I 模型，并采用简单的用户界面来识别和注释伤害，与不同人群合作，以捕捉在标准测试中可能被忽视的长尾安全问题。挑战以连续轮次进行，以便在 T2I 模型的安全隐患的发现和分析中持续进行。本文详细介绍了我们的方法、对新攻击策略的系统性研究以及挑战参与者揭示的安全故障的讨论。我们还发布了一个伴随的可视化工具，方便从数据集中探索和得出洞察。第一轮挑战结果得到了超过 10,000 个提示 - 图像对的安全机器注释，其中 1,500 个样本的注释包含丰富的人工伤害类型和攻击风格。我们发现，人类认为有害的图像中，有 14％被机器错误标记为 “安全”。我们已经发现了新的攻击策略，凸显了确保 T2I 模型的鲁棒性的复杂性。我们的研究结果强调了对新漏洞的持续审核和适应性的必要性。我们相信这项工作将促进积极的、迭代性的安全评估，并促进 T2I 模型的负责开发。

对抗性 Nibbler：一种用于识别文本到图像生成中多样化伤害的开放式红队方法

Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse  Harms in Text-to-Image Generation

Text-conditioned image generation models have recently achieved astonishing
image quality and alignment results. Consequently, they are employed in a
fast-growing number of applications. Since they are highly data-driven, relying
on billion-sized datasets randomly scraped from the web, they also produce
unsafe content. As a contribution to the Adversarial Nibbler challenge, we
distill a large set of over 1,000 potential adversarial inputs from existing
safety benchmarks. Our analysis of the gathered prompts and corresponding
images demonstrates the fragility of input filters and provides further
insights into systematic safety issues in current generative image models.

文本条件的图像生成模型在图像质量和对齐性方面取得了惊人的结果，然而它们依赖于从网络随机获取的数量庞大的数据集，因此也会生成不安全的内容。作为对 Adversarial Nibbler 挑战的贡献，我们从现有的安全基准中提取了超过 1,000 个潜在对抗性输入，通过对收集到的提示和相应的图像进行分析，揭示了输入过滤器的脆弱性并进一步深入研究了当前生成图像模型中的系统安全问题。

从安全基准筛选敌对提示：关于敌对咬地机挑战的报告

Distilling Adversarial Prompts from Safety Benchmarks: Report for the  Adversarial Nibbler Challenge

The generative AI revolution in recent years has been spurred by an expansion
in compute power and data quantity, which together enable extensive
pre-training of powerful text-to-image (T2I) models. With their greater
capabilities to generate realistic and creative content, these T2I models like
DALL-E, MidJourney, Imagen or Stable Diffusion are reaching ever wider
audiences. Any unsafe behaviors inherited from pretraining on uncurated
internet-scraped datasets thus have the potential to cause wide-reaching harm,
for example, through generated images which are violent, sexually explicit, or
contain biased and derogatory stereotypes. Despite this risk of harm, we lack
systematic and structured evaluation datasets to scrutinize model behavior,
especially adversarial attacks that bypass existing safety filters. A typical
bottleneck in safety evaluation is achieving a wide coverage of different types
of challenging examples in the evaluation set, i.e., identifying 'unknown
unknowns' or long-tail problems. To address this need, we introduce the
Adversarial Nibbler challenge. The goal of this challenge is to crowdsource a
diverse set of failure modes and reward challenge participants for successfully
finding safety vulnerabilities in current state-of-the-art T2I models.
Ultimately, we aim to provide greater awareness of these issues and assist
developers in improving the future safety and reliability of generative AI
models. Adversarial Nibbler is a data-centric challenge, part of the DataPerf
challenge suite, organized and supported by Kaggle and MLCommons.

该研究旨在解决文本到图像（text-to-image，T2I）模型的安全问题，通过引入 Adversarial Nibbler 挑战，通过收集和分析对当前 state-of-the-art T2I 模型的攻击，来提高人们对这些问题的认识。