The existence of adversarial examples has been a mystery for years and
attracted much interest. A well-known theory by \citet{ilyas2019adversarial}
explains adversarial vulnerability from a data perspective by showing that one
can extract non-robust features from adversarial examples and these features
alone are useful for classification. However, the explanation remains quite
counter-intuitive since non-robust features are mostly noise features to
humans. In this paper, we re-examine the theory from a larger context by
incorporating multiple learning paradigms. Notably, we find that contrary to
their good usefulness under supervised learning, non-robust features attain
poor usefulness when transferred to other self-supervised learning paradigms,
such as contrastive learning, masked image modeling, and diffusion models. It
reveals that non-robust features are not really as useful as robust or natural
features that enjoy good transferability between these paradigms. Meanwhile,
for robustness, we also show that naturally trained encoders from robust
features are largely non-robust under AutoAttack. Our cross-paradigm
examination suggests that the non-robust features are not really useful but
more like paradigm-wise shortcuts, and robust features alone might be
insufficient to attain reliable model robustness. Code is available at
https://github.com/PKU-ML/AdvNotRealFeatures.

对抗性示例的存在多年来一直是一个谜团，吸引了广泛的兴趣。本文从一个更大的背景视角重新审视这个理论，发现非鲁棒特征不像人类视为噪声特征那样有用，而具有良好转移性的鲁棒或自然特征更加有用。同时，我们还展示了经过鲁棒特征培训的编码器在 AutoAttack 下仍然是非鲁棒的，这表明仅凭鲁棒特征可能无法获得可靠的模型鲁棒性。

对抗样本不是真实特征

Adversarial Examples Are Not Real Features

Existing studies have demonstrated that adversarial examples can be directly
attributed to the presence of non-robust features, which are highly predictive,
but can be easily manipulated by adversaries to fool NLP models. In this study,
we explore the feasibility of capturing task-specific robust features, while
eliminating the non-robust ones by using the information bottleneck theory.
Through extensive experiments, we show that the models trained with our
information bottleneck-based method are able to achieve a significant
improvement in robust accuracy, exceeding performances of all the previously
reported defense methods while suffering almost no performance drop in clean
accuracy on SST-2, AGNEWS and IMDB datasets.

本文利用信息瓶颈理论，研究消除易受攻击的非鲁棒特征，提取任务特定的鲁棒特征。通过大量实验证明，我们的方法可在 SST-2、AGNEWS 和 IMDB 数据集上除了几乎不受性能下降的情况下，在鲁棒准确度上实现显著提高，超过以往任何被报道的方法。

通过信息瓶颈改善自然语言处理模型对抗攻击鲁棒性

Improving the Adversarial Robustness of NLP Models by Information Bottleneck

Adversarial examples have attracted significant attention in machine
learning, but the reasons for their existence and pervasiveness remain unclear.
We demonstrate that adversarial examples can be directly attributed to the
presence of non-robust features: features derived from patterns in the data
distribution that are highly predictive, yet brittle and incomprehensible to
humans. After capturing these features within a theoretical framework, we
establish their widespread existence in standard datasets. Finally, we present
a simple setting where we can rigorously tie the phenomena we observe in
practice to a misalignment between the (human-specified) notion of robustness
and the inherent geometry of the data.

本研究通过理论模型和实验数据，证明了对抗样本的普遍存在是由于数据分布中存在易碎且难以理解的非鲁棒特征的存在，进而解释了目前算法鲁棒性需要的人类专家知识与数据自身特性之间的不匹配性问题。