visual question answering (VQA) has attracted a lot of attention in both Computer Vision and Natural Language Processing communities, not least because it off?ers insight into the relationships between two important sources of information. Current datasets, and the models built upon th