visual question answering (VQA) has attracted a lot of attention in both
Computer Vision and Natural Language Processing communities, not least because
it offers insight into the relationships between two important sources of
information. Current datasets, and the models built upon the