Document Question Answering (QA) presents a challenge in understanding
visually-rich documents (VRD), particularly those dominated by lengthy textual
content like research journal articles. Existing studies primarily focus on
real-world documents with sparse text, while challenges persist in
comprehending the hierarchical semantic relations among multiple pages to
locate multimodal components. To address this gap, we propose PDF-MVQA, which
is tailored for research journal articles, encompassing multiple pages and
multimodal information retrieval. Unlike traditional machine reading
comprehension (MRC) tasks, our approach aims to retrieve entire paragraphs
containing answers or visually rich document entities like tables and figures.
Our contributions include the introduction of a comprehensive PDF Document VQA
dataset, allowing the examination of semantically hierarchical layout
structures in text-dominant documents. We also present new VRD-QA frameworks
designed to grasp textual contents and relations among document layouts
simultaneously, extending page-level understanding to the entire multi-page
document. Through this work, we aim to enhance the capabilities of existing
vision-and-language models in handling challenges posed by text-dominant
documents in VRD-QA.

针对长篇研究期刊文章等富有文本内容的视觉丰富文档，我们提出了 PDF-MVQA，旨在解决现有研究主要关注稀缺文本的现实世界文档的问题，而在理解多个页面之间的层次语义关系以定位多模态组件方面仍面临挑战。我们的贡献包括介绍了一个全面的 PDF 文档视觉问答数据集，用于研究文本主导文档中的语义层次布局结构。我们还提出了新的视觉丰富文档问答框架，同时考虑文档布局中的文本内容和关系，将页面级别理解扩展到整个多页文档。通过这项工作，我们旨在提高现有视觉和语言模型在处理视觉丰富文档视觉问答时的能力。