Textual backdoor attack, as a novel attack model, has been shown to be
effective in adding a backdoor to the model during training. Defending against
such backdoor attacks has become urgent and important. In this paper, we
propose AttDef, an efficient attribution-based pipeline to defend against two
insertion-based poisoning attacks, BadNL and InSent. Specifically, we regard
the tokens with larger attribution scores as potential triggers since larger
attribution words contribute more to the false prediction results and therefore
are more likely to be poison triggers. Additionally, we further utilize an
external pre-trained language model to distinguish whether input is poisoned or
not. We show that our proposed method can generalize sufficiently well in two
common attack scenarios (poisoning training data and testing data), which
consistently improves previous methods. For instance, AttDef can successfully
mitigate both attacks with an average accuracy of 79.97% (56.59% up) and 48.34%
(3.99% up) under pre-training and post-training attack defense respectively,
achieving the new state-of-the-art performance on prediction recovery over four
benchmark datasets.

提出了 AttDef 模型，该模型基于属性和预训练语言模型，可以有效防御 BadNL 和 InSent 两种插入型中毒攻击， 其中通过属性分析将大于特定阈值的词作为潜在的触发器，同时利用外部预训练语言模型鉴别是否有毒，该方法在四个基准数据集上实现了最新的预测恢复能力表现。