Image text carries essential information to understand the scene and perform reasoning. Text-based visual question answering (text VQA) task focuses on visual questions that require reading text in images. Existing text VQA systems generate an answer by selecting from optical character recognition (OCR) texts or a fixed vocabulary. Positional information of text is underused and there is a lack of evidence for the generated answer. As such, this paper proposes a localization-aware answer prediction network (LaAP-Net) to address this challenge. Our LaAP-Net not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer. Moreover, a context-enriched OCR representation (COR) for multimodal fusion is proposed to facilitate the localization task. Our proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin.

本文提出了一种基于局部信息的答案预测网络，即LaAP-Net，用于解决现有基于光学字符识别（OCR）或固定词汇的文本VQA系统中的局限性，其中定位信息得到了更好的利用。此外，提出了一种多模式融合技术，即COR，为定位任务提供了额外的上下文信息。LaAP-Net在三个基准数据集上的表现比现有方法都要好。

面向文本视觉问答的定位感知答案预测