Many vision and language tasks require commonsense reasoning beyond data-driven image and natural language processing. Here we adopt visual question answering (VQA) as an example task, where a system is expected to answer a question in natural language about an image. Current state-of-