Current visual question answering datasets do not consider the rich semantic
information conveyed by text within an image. In this work, we present a new
dataset, ST-VQA, that aims to highlight the importance of exploiting high-level
semantic information present in images as textual cues in the VQA process. We
use this dataset to define a series of tasks of increasing difficulty for which
reading the scene text in the context provided by the visual information is
necessary to reason and generate an appropriate answer. We propose a new
evaluation metric for these tasks to account both for reasoning errors as well
as shortcomings of the text recognition module. In addition we put forward a
series of baseline methods, which provide further insight to the newly released
dataset, and set the scene for further research.

本文介绍了一个新数据集，即 ST-VQA，旨在强调利用图像中文本信息的重要性。我们使用这个数据集定义了一系列难度不断增加的任务，需要利用图像中提供的上下文阅读场景文本以进行推理和生成适当的答案。我们提出了一个新的评估指标来考虑推理错误以及文本识别模块的缺陷，同时提出一系列基线方法。