We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do
text-VQA in encoder-decoder transformer models. The text-VQA task requires a
model to answer a question by understanding multi-modal content: text
(typically from OCR) and an associated image. To the best of our knowledge,
almost all previous approaches for text-VQA process a single question and its
associated content to predict a single answer. In order to answer multiple
questions from the same image, each question and content are fed into the model
multiple times. In contrast, our proposed MQMA approach takes multiple
questions and content as input at the encoder and predicts multiple answers at
the decoder in an auto-regressive manner at the same time. We make several
novel architectural modifications to standard encoder-decoder transformers to
support MQMA. We also propose a novel MQMA denoising pre-training task which is
designed to teach the model to align and delineate multiple questions and
content with associated answers. MQMA pre-trained model achieves
state-of-the-art results on multiple text-VQA datasets, each with strong
baselines. Specifically, on OCR-VQA (+2.5%), TextVQA (+1.4%), ST-VQA (+0.6%),
DocVQA (+1.1%) absolute improvements over the previous state-of-the-art
approaches.

通过多问多答（MQMA）方法，在编码 - 解码变压器模型中进行文本 - VQA（Visual Question Answering）。通过多次将不同问题和内容输入到模型中进行预测，实现对同一图像的多问题回答预测。提出几个新颖的架构修改来支持 MQMA，并且通过 MQMA 去噪的预训练任务，使模型能够对多个问题以及相关答案进行对齐和划分。在多个文本 - VQA 数据集上，MQMA 预训练模型实现了与先前最先进方法相比的明显改进（OCR-VQA：+2.5％，TextVQA：+1.4％，ST-VQA：+0.6％，DocVQA：+1.1％）。

多问题多答案文本视觉问答

Multiple-Question Multiple-Answer Text-VQA

As an important task in multimodal context understanding, Text-VQA (Visual
Question Answering) aims at question answering through reading text information
in images. It differentiates from the original VQA task as Text-VQA requires
large amounts of scene-text relationship understanding, in addition to the
cross-modal grounding capability. In this paper, we propose Localize, Group,
and Select (LOGOS), a novel model which attempts to tackle this problem from
multiple aspects. LOGOS leverages two grounding tasks to better localize the
key information of the image, utilizes scene text clustering to group
individual OCR tokens, and learns to select the best answer from different
sources of OCR (Optical Character Recognition) texts. Experiments show that
LOGOS outperforms previous state-of-the-art methods on two Text-VQA benchmarks
without using additional OCR annotation data. Ablation studies and analysis
demonstrate the capability of LOGOS to bridge different modalities and better
understand scene text.

本文提出了一种名为 Localize, Group, and Select (LOGOS) 的模型，它利用场景文本聚类和光学字符识别（OCR）技术来更好地定位图像的关键信息、实现跨模态理解，并从不同来源的 OCR 文本中选择最佳答案，成功解决了多模态上下文理解中的 Text-VQA 问题，实验表明，该模型在两个 Text-VQA 基准测试上表现优于其他方法。

本地化、分组和选择：通过场景文本建模提升文本 - VQA

Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling

In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and
Text-Caption tasks. These two tasks aim at reading and understanding scene text
in images for question answering and image caption generation, respectively. In
contrast to the conventional vision-language pre-training that fails to capture
scene text and its relationship with the visual and text modalities, TAP
explicitly incorporates scene text (generated from OCR engines) in
pre-training. With three pre-training tasks, including masked language modeling
(MLM), image-text (contrastive) matching (ITM), and relative (spatial) position
prediction (RPP), TAP effectively helps the model learn a better aligned
representation among the three modalities: text word, visual object, and scene
text. Due to this aligned representation learning, even pre-trained on the same
downstream task dataset, TAP already boosts the absolute accuracy on the
TextVQA dataset by +5.4%, compared with a non-TAP baseline. To further improve
the performance, we build a large-scale dataset based on the Conceptual Caption
dataset, named OCR-CC, which contains 1.4 million scene text-related image-text
pairs. Pre-trained on this OCR-CC dataset, our approach outperforms the state
of the art by large margins on multiple tasks, i.e., +8.3% accuracy on TextVQA,
+8.6% accuracy on ST-VQA, and +10.2 CIDEr score on TextCaps.

本文提出了一种名为 TAP 的方法，通过使用光学字符识别引擎生成的图像文字来预训练模型，从而帮助模型在三种模态 —— 文本单词、视觉对象和场景文本中学习更好的对齐表示，在多个任务上均表现出卓越的性能。