Recent studies on adversarial images have shown that they tend to leave the
underlying low-dimensional data manifold, making them significantly more
challenging for current models to make correct predictions. This so-called
off-manifold conjecture has inspired a novel line of defenses against
adversarial attacks on images. In this study, we find a similar phenomenon
occurs in the contextualized embedding space induced by pretrained language
models, in which adversarial texts tend to have their embeddings diverge from
the manifold of natural ones. Based on this finding, we propose Textual
Manifold-based Defense (TMD), a defense mechanism that projects text embeddings
onto an approximated embedding manifold before classification. It reduces the
complexity of potential adversarial examples, which ultimately enhances the
robustness of the protected model. Through extensive experiments, our method
consistently and significantly outperforms previous defenses under various
attack settings without trading off clean accuracy. To the best of our
knowledge, this is the first NLP defense that leverages the manifold structure
against adversarial attacks. Our code is available at
https://github.com/dangne/tmd.

本文研究了利用预训练语言模型诱导的上下文嵌入空间中的对抗文本的嵌入发散现象，并提出了一种基于嵌入流形的文本防御机制，将文本嵌入映射到近似嵌入流形上进行分类，从而增强模型的鲁棒性。实验证明，该方法在不牺牲准确性的前提下，在各种攻击设置下始终显著优于以前的防御方法。