Visual question answering on document images that contain textual, visual,
and layout information, called document VQA, has received much attention
recently. Although many datasets have been proposed for developing document VQA
systems, most of the existing datasets focus on understanding the content
relationships within a single image and not across multiple images. In this
study, we propose a new multi-image document VQA dataset, SlideVQA, containing
2.6k+ slide decks composed of 52k+ slide images and 14.5k questions about a
slide deck. SlideVQA requires complex reasoning, including single-hop,
multi-hop, and numerical reasoning, and also provides annotated arithmetic
expressions of numerical answers for enhancing the ability of numerical
reasoning. Moreover, we developed a new end-to-end document VQA model that
treats evidence selection and question answering in a unified
sequence-to-sequence format. Experiments on SlideVQA show that our model
outperformed existing state-of-the-art QA models, but that it still has a large
gap behind human performance. We believe that our dataset will facilitate
research on document VQA.

提出了一个包含多种信息的文档图像的逻辑问答系统，包括视觉、文本和排版信息。SlideVQA 是一个用于复杂推理的新的多图像文档数据集，利用序列到序列模型同时处理证据选择和问题回答。实验结果表明，该方法在 SlideVQA 数据集上表现出了较好的效果。