This paper improves the robustness of the pretrained language model BERT against word substitution-based adversarial attacks by leveraging self-supervised contrastive learning with adversarial perturbations. One advantage of our method compared to previous works is that it is capable of improving model robustness without using any labels. Additionally, we also create an adversarial attack for word-level adversarial training on BERT. The attack is efficient, allowing adversarial training for BERT on adversarial examples generated on the fly during training. Experimental results on four datasets show that our method improves the robustness of BERT against four different word substitution-based adversarial attacks. Furthermore, to understand why our method can improve the model robustness against adversarial attacks, we study vector representations of clean examples and their corresponding adversarial examples before and after applying our method. As our method improves model robustness with unlabeled raw data, it opens up the possibility of using large text datasets to train robust language models.

本文介绍了一种提高BERT自然语言模型对基于单词替代的对抗性攻击的鲁棒性的方法，该方法利用对比学习的对抗扰动来创建生成困难正例的单词级对抗攻击，实验结果表明我们的方法提高了BERT对四种不同基于单词替换的对抗性攻击的鲁棒性，结合对抗性训练可以获得比单一对抗性训练更高的鲁棒性。由于我们的方法仅使用未标记数据提高BERT的鲁棒性，因此可以使用大型文本数据集训练出抗击单词替换的对抗性攻击的强壮自然语言模型。

自监督对比学习及对抗扰动用于防御基于词替换的攻击