The Document-based Visual Question Answering competition addresses the
automatic detection of parent-child relationships between elements in
multi-page documents. The goal is to identify the document elements that answer
a specific question posed in natural language. This paper describes the
PoliTo's approach to addressing this task, in particular, our best solution
explores a text-only approach, leveraging an ad hoc sampling strategy.
Specifically, our approach leverages the Masked Language Modeling technique to
fine-tune a BERT model, focusing on sentences containing sensitive keywords
that also occur in the questions, such as references to tables or images.
Thanks to the effectiveness of this approach, we are able to achieve high
performance compared to baselines, demonstrating how our solution contributes
positively to this task.

本文描述了 PoliTo 对文档视觉问答竞赛的方法，特别是我们利用文本方法和特定的采样策略，通过细调 BERT 模型，关注包含敏感关键词的句子以回答自然语言问题，如引用表格或图片的问题，以实现高性能的结果。

关键词驱动的句子选择增强基于 BERT 的视觉问答

Enhancing BERT-Based Visual Question Answering through Keyword-Driven  Sentence Selection

Document-based Visual Question Answering poses a challenging task between
linguistic sense disambiguation and fine-grained multimodal retrieval. Although
there has been encouraging progress in document-based question answering due to
the utilization of large language and open-world prior models\cite{1}, several
challenges persist, including prolonged response times, extended inference
durations, and imprecision in matching. In order to overcome these challenges,
we propose Jaegar, a concatenation-based multi-transformer VQA model. To derive
question features, we leverage the exceptional capabilities of RoBERTa
large\cite{2} and GPT2-xl\cite{3} as feature extractors. Subsequently, we
subject the outputs from both models to a concatenation process. This operation
allows the model to consider information from diverse sources concurrently,
strengthening its representational capability. By leveraging pre-trained models
for feature extraction, our approach has the potential to amplify the
performance of these models through concatenation. After concatenation, we
apply dimensionality reduction to the output features, reducing the model's
computational effectiveness and inference time. Empirical results demonstrate
that our proposed model achieves competitive performance on Task C of the
PDF-VQA Dataset. If the user adds any new data, they should make sure to style
it as per the instructions provided in previous sections.

本文介绍了 Jaegar，一种基于连接的多转换器 VQA 模型，用于解决基于文档的视觉问答中的挑战。该模型利用 RoBERTa large 和 GPT2-xl 作为特征提取器，并通过将两个模型的输出进行连接来加强其表示能力，以减少计算复杂性和推理时间。实证结果表明，该模型在 PDF-VQA 数据集的 C 任务上具有竞争力的性能。

Jaeger：一种基于串联的多 Transformer VQA 模型

Jaeger: A Concatenation-Based Multi-Transformer VQA Model

Document-based Visual Question Answering examines the document understanding
of document images in conditions of natural language questions. We proposed a
new document-based VQA dataset, PDF-VQA, to comprehensively examine the
document understanding from various aspects, including document element
recognition, document layout structural understanding as well as contextual
understanding and key information extraction. Our PDF-VQA dataset extends the
current scale of document understanding that limits on the single document page
to the new scale that asks questions over the full document of multiple pages.
We also propose a new graph-based VQA model that explicitly integrates the
spatial and hierarchically structural relationships between different document
elements to boost the document structural understanding. The performances are
compared with several baselines over different question types and
tasks\footnote{The full dataset will be released after paper acceptance.

本研究提出了一种基于文档的视觉问答模型，并通过新开发的 PDF-VQA 数据集综合考察了文档理解的不同方面，包括文档元素识别、文档结构理解以及上下文理解和关键信息提取，在模型中明确地将文档元素之间的空间和层级结构关系整合起来，以此增强文档结构理解的能力。