We present a new algorithm to learn a deep neural network model robust against adversarial attacks. Previous algorithms demonstrate an adversarially trained Bayesian Neural Network (BNN) provides improved robustness. We recognize the adversarial learning approach for approximating the multi-modal posterior distribution of a Bayesian model can lead to mode collapse; consequently, the model's achievements in robustness and performance are sub-optimal. Instead, we first propose preventing mode collapse to better approximate the multi-modal posterior distribution. Second, based on the intuition that a robust model should ignore perturbations and only consider the informative content of the input, we conceptualize and formulate an information gain objective to measure and force the information learned from both benign and adversarial training instances to be similar. Importantly. we prove and demonstrate that minimizing the information gain objective allows the adversarial risk to approach the conventional empirical risk. We believe our efforts provide a step toward a basis for a principled method of adversarially training BNNs. Our model demonstrate significantly improved robustness--up to 20%--compared with adversarial training and Adv-BNN under PGD attacks with 0.035 distortion on both CIFAR-10 and STL-10 datasets.

本文提出了一种新算法来训练深度神经网络模型以抵御对抗攻击，并提出防止模式崩溃以更好地逼近多模式贝叶斯模型的后验分布的方法。其提出的信息增益目标证明了该算法可以使其在抗对抗风险逼近常规经验风险，并且证明了其在CIFAR-10和STL-10数据集上比现有算法实现了更高的鲁棒性30％。

贝叶斯学习结合信息增益可靠地限制对抗性风险