Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Paradoxically, empirical evidence indicates that even systems which are robust to large random perturbations of the input data remain susceptible to small, easily constructed, adversarial perturbations of their inputs. Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data. We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability -- notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data. We confirm that the same phenomena are directly observed in practical neural networks trained on standard image classification problems, where even large additive random noise fails to trigger the adversarial instability of the network. A surprising takeaway is that even small margins separating a classifier's decision surface from training and testing data can hide adversarial susceptibility from being detected using randomly sampled perturbations. Counterintuitively, using additive noise during training or testing is therefore inefficient for eradicating or detecting adversarial examples, and more demanding adversarial training is required.

通过对高维度输入数据的实践系统进行观察，我们展示了对于那些容易构建的对抗性攻击及其对大多数模型的威胁性，以及随机扰动的鲁棒性同时易受影响的基本特性，证实了这一现象。然而，令人惊讶的是，即使对于分类器决策边界与训练和测试数据之间只有很小的边距，也很难通过随机取样的扰动来检测到对抗性示例，因此需要更严格的对抗性训练。

对抗性攻击如何干扰表面稳定准确的分类器