In this paper, we propose a novel multi-modal framework for Scene Text Visual
Question Answering (STVQA), which requires models to read scene text in images
for question answering. Apart from text or visual objects, which could exist
independently, scene text naturally links text and visual modalities together
by conveying linguistic semantics while being a visual object in an image
simultaneously. Different to conventional STVQA models which take the
linguistic semantics and visual semantics in scene text as two separate
features, in this paper, we propose a paradigm of "Locate Then Generate" (LTG),
which explicitly unifies this two semantics with the spatial bounding box as a
bridge connecting them. Specifically, at first, LTG locates the region in an
image that may contain the answer words with an answer location module (ALM)
consisting of a region proposal network and a language refinement network, both
of which can transform to each other with one-to-one mapping via the scene text
bounding box. Next, given the answer words selected by ALM, LTG generates a
readable answer sequence with an answer generation module (AGM) based on a
pre-trained language model. As a benefit of the explicit alignment of the
visual and linguistic semantics, even without any scene text based pre-training
tasks, LTG can boost the absolute accuracy by +6.06% and +6.92% on the TextVQA
dataset and the ST-VQA dataset respectively, compared with a non-pre-training
baseline. We further demonstrate that LTG effectively unifies visual and text
modalities through the spatial bounding box connection, which is
underappreciated in previous methods.

提出了一个用于场景文本视觉问答的多模态框架，采用 “先定位再生成” 的范式，将空间边界框作为连接文本和视觉模态的桥梁，通过预先训练的语言模型增强绝对准确率。

定位再生成：通过边界框桥接视觉和语言进行场景文本 VQA

Locate Then Generate: Bridging Vision and Language with Bounding Box for  Scene-Text VQA

We propose a novel multimodal architecture for Scene Text Visual Question
Answering (STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA
requires models to reason over different modalities. Thus, we first investigate
the impact of each modality, and reveal the importance of the language module,
especially when enriched with layout information. Accounting for this, we
propose a single objective pre-training scheme that requires only text and
spatial cues. We show that applying this pre-training scheme on scanned
documents has certain advantages over using natural images, despite the domain
gap. Scanned documents are easy to procure, text-dense and have a variety of
layouts, helping the model learn various spatial cues (e.g. left-of, below
etc.) by tying together language and layout information. Compared to existing
approaches, our method performs vocabulary-free decoding and, as shown,
generalizes well beyond the training vocabulary. We further demonstrate that
LaTr improves robustness towards OCR errors, a common reason for failure cases
in STVQA. In addition, by leveraging a vision transformer, we eliminate the
need for an external object detector. LaTr outperforms state-of-the-art STVQA
methods on multiple datasets. In particular, +7.6% on TextVQA, +10.8% on ST-VQA
and +4.0% on OCR-VQA (all absolute accuracy numbers).

提出了一种新的多模态体系结构 Layout-Aware Transformer（LaTr）来进行场景文本视觉问答（STVQA），并提出了一种单一目标的预训练方案，该方案仅需要文本和空间线索。 LaTr 通过将语言和布局信息联系起来，可以学习各种空间线索，从而提高了对 OCR 错误的鲁棒性，并在多个数据集上优于最先进的 STVQA 方法。