Adversarial examples are input examples that are specifically crafted to deceive machine learning classifiers. State-of-the-art adversarial example detection methods characterize an input example as adversarial either by quantifying the magnitude of feature variations under multiple perturbations or by measuring its distance from estimated benign example distribution. Instead of using such metrics, the proposed method is based on the observation that the directions of adversarial gradients when crafting (new) adversarial examples play a key role in characterizing the adversarial space. Compared to detection methods that use multiple perturbations, the proposed method is efficient as it only applies a single random perturbation on the input example. Experiments conducted on two different databases, CIFAR-10 and ImageNet, show that the proposed detection method achieves, respectively, 97.9% and 98.6% AUC-ROC (on average) on five different adversarial attacks, and outperforms multiple state-of-the-art detection methods. Results demonstrate the effectiveness of using adversarial gradient directions for adversarial example detection.

提出了一种基于对抗梯度方向的对抗示例检测方法，用于识别特制的输入，以欺骗机器学习分类器，此方法仅应用一个随机扰动对输入示例进行检测，实验表明，相比使用多个扰动的检测方法，该方法在对抗攻击方面表现更好。在多个数据集上的实验表明，该方法相应的AUC-ROC值为97.9％和98.6％，并且胜过了多种其他顶尖的检测方法。

利用对抗梯度方向进行对抗样本检测，战胜攻击者