There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images. These tasks focus on literal descriptions of the image. To move beyond the literal, we choose to explore how questions about an image often address abstract events that the objects evoke. In this paper, we introduce the novel task of 'Visual Question Generation (VQG)', where the system is tasked with asking a natural and engaging question when shown an image. We provide three datasets which cover a variety of images from object-centric to event-centric, providing different and more abstract training data than the state-of-the-art captioning systems have used thus far. We train and test several generative and retrieval models to tackle the task of VQG. Evaluation results show that while such models ask reasonable questions given various images, there is still a wide gap with human performance. Our proposed task offers a new challenge to the community which we hope can spur further interest in exploring deeper connections between vision & language.

探索了关于一张图片是如何引发常识推理和抽象事件的问题，提出了一项新颖任务视觉问题生成（VQG），该系统的任务是在展示了一张图片后提出自然而引人入胜的问题，我们提供了三个数据集，涵盖了从目标为中心到事件为中心的各种图像，其中训练数据远比现有最先进的字幕系统提供的数据更抽象，通过训练和测试几种生成和检索模型来解决VQG这个任务，评估结果表明，尽管这样的模型为各种图像提出了合理的问题，但与人类性能的差距仍然很大，这激发了我们进一步探索将图像与常识和语用知识联系起来的相关研究。

生成关于图像的自然问题