In this paper, we propose a new task and solution for vision and language: generation of grounded visual questions. Visual question answering (VQA) is an emerging topic which links textual questions with visual input. To the best of our knowledge, it lacks automatic method to generate reasonable and versatile questions. So far, almost all the textual questions are generated manually, as well as the corresponding answers. To this end, we propose a system that automatically generates visually grounded questions . First, visual input is analyzed with deep caption model. Second, the captions along with VGG-16 features are used as input for our proposed question generator to generate visually grounded questions. Finally, to enable generating of versatile questions, a question type selection module is provided which selects reasonable question types and provide them as parameters for question generation. This is done using a hybrid LSTM with both visual and answer input. Our system is trained using VQA and Visual7W dataset and shows reasonable results on automatically generating of new visual questions. We also propose a quantitative metric for automatic evaluation of the question quality.

本文提出了一种模型，能够从单个图像中生成具有不同类型的视觉问题。通过采样最可靠的问题类型来自动生成问，该方法在两个真实世界数据集上的实验结果表明，我们的模型在正确性和多样性方面都优于最强基线。

基于场景的视觉问题的自动生成