Deep learning has achieved impressive results in many areas of Computer Vision and Natural Language Pro- cessing. Among others, Visual Question Answering (VQA), also referred to a visual Turing test, is considered one of the most compelling problems, and recent deep learning models have reported significant progress in vision and language modeling. Although Artificial Intelligence (AI) is getting closer to passing the visual Turing test, at the same time the existence of adversarial examples to deep learning systems may hinder the practical application of such systems. In this work, we conduct the first extensive study on adversarial examples for VQA systems. In particular, we focus on generating targeted adversarial examples for a VQA system while the target is considered to be a question-answer pair. Our evaluation shows that the success rate of whether a targeted adversarial example can be generated is mostly dependent on the choice of the target question-answer pair, and less on the choice of images to which the question refers. We also report the language prior phenomenon of a VQA model, which can explain why targeted adversarial examples are hard to generate for some question-answer targets. We also demonstrate that a compositional VQA architecture is slightly more resilient to adversarial attacks than a non-compositional one. Our study sheds new light on how to build deep vision and language resilient models robust against adversarial examples.

本文研究了视觉和语言模型的对抗样本，评估发现在具备自然语言理解和复杂结构（如注意力、边界框定位和组合内部结构）的模型中可以生成高成功率的对抗样本，这些观察结果可以帮助建立有效的防御措施。

尽管有定位和注意力机制，仍然能够欺骗视觉和语言模型