In this paper, we propose a novel adversarial defence mechanism for image classification -- CARSO -- inspired by cues from cognitive neuroscience. The method is synergistically complementary to adversarial training and relies on knowledge of the internal representation of the attacked classifier. Exploiting a generative model for adversarial purification, conditioned on such representation, it samples reconstructions of inputs to be finally classified. Experimental evaluation by a well-established benchmark of varied, strong adaptive attacks, across diverse image datasets and classifier architectures, shows that CARSO is able to defend the classifier significantly better than state-of-the-art adversarial training alone -- with a tolerable clean accuracy toll. Furthermore, the defensive architecture succeeds in effectively shielding itself from unforeseen threats, and end-to-end attacks adapted to fool stochastic defences. Code and pre-trained models are available at https://github.com/emaballarin/CARSO .

该研究提出了一种基于认知神经科学线索的新型对抗性防御机制——CARSO，它是对抗性训练的一种协同补充，依赖于被攻击分类器的内部表示知识，利用生成模型进行对抗性净化，实验结果表明，该机制比现有的对抗性训练能够更好地保护被攻击的分类器，并可有效防御意想不到的威胁和对采用随机防御的端到端攻击进行针对性干扰。

CARSO: 合成观察的反对抗性召回