Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text
in images and answer questions related to the text content. Most existing
methods heavily rely on the accuracy of Optical Character Recognition (OCR)
systems, and aggressive fine-tuning based on limited spatial location
information and erroneous OCR text information often leads to inevitable
overfitting. In this paper, we propose a multimodal adversarial training
architecture with spatial awareness capabilities. Specifically, we introduce an
Adversarial OCR Enhancement (AOE) module, which leverages adversarial training
in the embedding space of OCR modality to enhance fault-tolerant representation
of OCR texts, thereby reducing noise caused by OCR errors. Simultaneously, We
add a Spatial-Aware Self-Attention (SASA) mechanism to help the model better
capture the spatial relationships among OCR tokens. Various experiments
demonstrate that our method achieves significant performance improvements on
both the ST-VQA and TextVQA datasets and provides a novel paradigm for
multimodal adversarial training.

本研究提出了一种多模态的对抗训练架构，其中引入了对抗性 OCR 增强（AOE）模块和空间感知自注意力（SASA）机制，旨在改善场景文本视觉问答的性能，并为多模态对抗训练提供了新的方法。