Recent NLP literature pays little attention to the robustness of toxicity
language predictors, while these systems are most likely to be used in
adversarial contexts. This paper presents a novel adversarial attack,
\texttt{ToxicTrap}, introducing small word-level perturbations to fool SOTA
text classifiers to predict toxic text samples as benign. ToxicTrap exploits
greedy based search strategies to enable fast and effective generation of toxic
adversarial examples. Two novel goal function designs allow ToxicTrap to
identify weaknesses in both multiclass and multilabel toxic language detectors.
Our empirical results show that SOTA toxicity text classifiers are indeed
vulnerable to the proposed attacks, attaining over 98\% attack success rates in
multilabel cases. We also show how a vanilla adversarial training and its
improved version can help increase robustness of a toxicity detector even
against unseen attacks.

最近的自然语言处理文献很少关注毒性语言预测器的稳健性，而这些系统最有可能在对抗性环境中使用。本文提出了一种新的对抗性攻击方法 ToxicTrap，通过引入小的单词级扰动来欺骗最先进的文本分类器，将有毒的文本样本预测为良性。ToxicTrap 利用贪婪的搜索策略，实现了快速有效地生成有毒对抗样本。通过两个新颖的目标函数设计，ToxicTrap 可以识别多类别和多标签毒性语言检测器的弱点。我们的实证结果表明，SOTA 的毒性文本分类器确实容易受到这种攻击的影响，在多标签情况下攻击成功率超过 98％。我们还展示了如何使用普通的对抗训练及其改进版来增强毒性检测器的鲁棒性，即使面对未知的攻击。