Safety classifiers are critical in mitigating toxicity on online forums such
as social media and in chatbots. Still, they continue to be vulnerable to
emergent, and often innumerable, adversarial attacks. Traditional automated
adversarial data generation methods, however, tend to produce attacks that are
not diverse, but variations of previously observed harm types. We formalize the
task of automated adversarial discovery for safety classifiers - to find new
attacks along previously unseen harm dimensions that expose new weaknesses in
the classifier. We measure progress on this task along two key axes (1)
adversarial success: does the attack fool the classifier? and (2) dimensional
diversity: does the attack represent a previously unseen harm type? Our
evaluation of existing attack generation methods on the CivilComments toxicity
task reveals their limitations: Word perturbation attacks fail to fool
classifiers, while prompt-based LLM attacks have more adversarial success, but
lack dimensional diversity. Even our best-performing prompt-based method finds
new successful attacks on unseen harm dimensions of attacks only 5\% of the
time. Automatically finding new harmful dimensions of attack is crucial and
there is substantial headroom for future research on our new task.

安全分类器和对抗攻击是在线论坛（如社交媒体和聊天机器人）中减少毒性的关键，然而它们仍然容易受到新兴且数量众多的对抗攻击的影响。本文提出了一种自动对抗发现安全分类器的方法，以在以前未见的伤害维度上寻找新的攻击方法，以揭示分类器的新弱点。我们通过两个主要指标来衡量这个任务的进展（1）对抗成功性：攻击是否欺骗了分类器？（2）维度多样性：攻击是否代表了以前未见的伤害类型？通过对 CivilComments 毒性任务中的现有攻击生成方法进行评估，发现它们存在局限性：词汇扰动攻击无法欺骗分类器，而基于提示的 LLM 攻击具有更高的对抗成功性，但缺乏维度多样性。即使是我们最有效的基于提示的方法，仍然只在攻击的以前未见的伤害维度上成功了 5％的时间。自动发现攻击的新的有害维度至关重要，并且在这个新任务上未来研究有巨大的潜力。