Adversarial examples are malicious inputs designed to fool machine learning models. They often transfer from one model to another, allowing attackers to mount black box attacks without knowledge of the target model's parameters. Adversarial training is the process of explicitly training a model on adversarial examples, in order to make it more robust to attack or to reduce its test error on clean inputs. So far, adversarial training has primarily been applied to small problems. In this research, we apply adversarial training to ImageNet. Our contributions include: (1) recommendations for how to succesfully scale adversarial training to large models and datasets, (2) the observation that adversarial training confers robustness to single-step attack methods, (3) the finding that multi-step attack methods are somewhat less transferable than single-step attack methods, so single-step attacks are the best for mounting black-box attacks, and (4) resolution of a "label leaking" effect that causes adversarially trained models to perform better on adversarial examples than on clean examples, because the adversarial example construction process uses the true label and the model can learn to exploit regularities in the construction process.

将对抗训练应用于ImageNet，并提出了如何将对抗训练成功扩展到大型模型和数据集的建议，发现对抗训练能增加对单步攻击方法的鲁棒性，单步攻击方法比多步攻击方法更难以传递，使其成为发动黑盒攻击的最佳选择。研究还揭示了“标签泄漏”效应，因为对抗样本构建过程使用真实标签，模型可以学习利用构建过程的规律，使经过对抗训练的模型在对抗示例上表现比正常示例更好。

规模化对抗机器学习