Large-scale language models achieved state-of-the-art performance over a
number of language tasks. However, they fail on adversarial language examples,
which are sentences optimized to fool the language models but with similar
semantic meanings for humans. While prior work focuses on making the language
model robust at training time, retraining for robustness is often unrealistic
for large-scale foundation models. Instead, we propose to make the language
models robust at test time. By dynamically adapting the input sentence with
predictions from masked words, we show that we can reverse many language
adversarial attacks. Since our approach does not require any training, it works
for novel tasks at test time and can adapt to novel adversarial corruptions.
Visualizations and empirical results on two popular sentence classification
datasets demonstrate that our method can repair adversarial language attacks
over 65% o

大规模语言模型在很多语言任务上取得了最先进的性能。然而，它们在针对对抗性语言示例时失败了，这些句子被精心优化以欺骗语言模型，但对人类来说具有类似的语义意义。我们的方法可以动态地适应输入句子并使用屏蔽词的预测结果，从而修复许多语言对抗攻击，而不需要任何训练。在两个流行的句子分类数据集上进行的可视化和实证结果表明，我们的方法能够修复超过 65% 的对抗性语言攻击。