Deep neural network (DNN) predictions have been shown to be vulnerable to carefully crafted adversarial perturbations. Specifically, so-called universal adversarial perturbations are image-agnostic perturbations that can be added to any image and can fool a target network into making erroneous predictions. Departing from existing adversarial defense strategies, which work in the image domain, we present a novel defense which operates in the DNN feature domain and effectively defends against such universal adversarial attacks. Our approach identifies pre-trained convolutional features that are most vulnerable to adversarial noise and deploys defender units which transform (regenerate) these DNN filter activations into noise-resilient features, guarding against unseen adversarial perturbations. The proposed defender units are trained using a target loss on synthetic adversarial perturbations, which we generate with a novel efficient synthesis method. We validate the proposed method for different DNN architectures, and demonstrate that it outperforms existing defense strategies across network architectures by more than 10% in restored accuracy. Moreover, we demonstrate that the approach also improves resilience of DNNs to other unseen adversarial attacks.

本文提出了一种新的深度神经网络防御机制，该机制通过对DNN特征域中最易受到对抗性噪声攻击的预训练的卷积特征进行可训练的特征重建，将这些DNN滤波器激活转换成鲁棒性更高的特征，从而有效地保护免受通用扰动的攻击。通过重建至多6个DNN层中顶部50%的易受攻击的激活并保留所有剩余的激活状态，无需其他修改，我们的防御在 ImageNet 上经过一个通用对抗攻击的训练即可对抗其他类型的通用攻击。

通过选择性特征重建来抵御普适性攻击