Safety classifiers are critical in mitigating toxicity on online forums such
as social media and in chatbots. Still, they continue to be vulnerable to
emergent, and often innumerable, adversarial attacks. Traditional automated
adversarial data generation methods, however, tend to produce attacks that are
not diverse, but variations of previously observed harm types. We formalize the
task of automated adversarial discovery for safety classifiers - to find new
attacks along previously unseen harm dimensions that expose new weaknesses in
the classifier. We measure progress on this task along two key axes (1)
adversarial success: does the attack fool the classifier? and (2) dimensional
diversity: does the attack represent a previously unseen harm type? Our
evaluation of existing attack generation methods on the CivilComments toxicity
task reveals their limitations: Word perturbation attacks fail to fool
classifiers, while prompt-based LLM attacks have more adversarial success, but
lack dimensional diversity. Even our best-performing prompt-based method finds
new successful attacks on unseen harm dimensions of attacks only 5\% of the
time. Automatically finding new harmful dimensions of attack is crucial and
there is substantial headroom for future research on our new task.

安全分类器和对抗攻击是在线论坛（如社交媒体和聊天机器人）中减少毒性的关键，然而它们仍然容易受到新兴且数量众多的对抗攻击的影响。本文提出了一种自动对抗发现安全分类器的方法，以在以前未见的伤害维度上寻找新的攻击方法，以揭示分类器的新弱点。我们通过两个主要指标来衡量这个任务的进展（1）对抗成功性：攻击是否欺骗了分类器？（2）维度多样性：攻击是否代表了以前未见的伤害类型？通过对 CivilComments 毒性任务中的现有攻击生成方法进行评估，发现它们存在局限性：词汇扰动攻击无法欺骗分类器，而基于提示的 LLM 攻击具有更高的对抗成功性，但缺乏维度多样性。即使是我们最有效的基于提示的方法，仍然只在攻击的以前未见的伤害维度上成功了 5％的时间。自动发现攻击的新的有害维度至关重要，并且在这个新任务上未来研究有巨大的潜力。

自动对抗性发现用于安全分类器

Automated Adversarial Discovery for Safety Classifiers

As large language models (LLMs) are widely adopted, new safety issues and
policies emerge, to which existing safety classifiers do not generalize well.
If we have only observed a few examples of violations of a new safety rule, how
can we build a classifier to detect violations? In this paper, we study the
novel setting of domain-generalized few-shot learning for LLM-based text safety
classifiers. Unlike prior few-shot work, these new safety issues can be hard to
uncover and we do not get to choose the few examples. We demonstrate that
existing few-shot techniques do not perform well in this setting, and rather we
propose to do parameter-efficient fine-tuning (PEFT) combined with augmenting
training data based on similar examples in prior existing rules. We empirically
show that our approach of similarity-based data-augmentation + prompt-tuning
(DAPT) consistently outperforms baselines that either do not rely on data
augmentation or on PEFT by 7-17% F1 score in the Social Chemistry moral
judgement and 9-13% AUC in the Toxicity detection tasks, even when the new rule
is loosely correlated with existing ones.

领域通用的少样本学习方法进行调优和数据增强，相较于传统方法，在社交化化学道德判断和毒性检测任务中提高了 7-17% 的 F1 分数和 9-13% 的 AUC。