We propose the task of free-form and open-ended visual question answering
(VQA). Given an image and a natural language question about the image, the task
is to provide an accurate natural language answer. Mirroring real-world
scenarios, such as helping the visually impaired, both the q