This paper investigates recently proposed approaches for defending against adversarial examples and evaluating adversarial robustness. The existence of adversarial examples in trained neural networks reflects the fact that expected risk alone does not capture the model's performance against worst-case inputs. We motivate the use of adversarial risk as an objective, although it cannot easily be computed exactly. We then frame commonly used attacks and evaluation metrics as defining a tractable surrogate objective to the true adversarial risk. This suggests that models may be obscured to adversaries, by optimizing this surrogate rather than the true adversarial risk. We demonstrate that this is a significant problem in practice by repurposing gradient-free optimization techniques into adversarial attacks, which we use to decrease the accuracy of several recently proposed defenses to near zero. Our hope is that our formulations and results will help researchers to develop more powerful defenses.

本文研究了针对对抗性样本的最新的防御方法和评估对抗性鲁棒性的方法，提出了“对抗风险”作为实现模型鲁棒性的目标，并将常用的攻击和评估度量框架化为真正的对抗风险的可行替代目标，指出模型可能会优化该替代目标而不是对抗风险，发展了识别混淆模型和设计透明模型的工具和启发式方法，并通过重新调整梯度自由优化技术为对抗攻击来证明这在实践中是一个重大问题，这被用于将几个最近提出的防御的准确性降低到接近零。我们希望我们的公式和结果能够帮助研究者开发更强大的防御措施。

对抗性风险与评估弱攻击的危害