Vision-Language (VL) models have garnered considerable research interest; however, they still face challenges in effectively handling text within images. To address this limitation, researchers have developed two approaches. The first method involves utilizing external Optical Character Recognition (OCR) tools to extract textual information from images, which is then prepended to other textual inputs. The second strategy focuses on employing extremely high-resolution images to improve text recognition capabilities. In this paper, we focus on enhancing the first strategy by introducing a novel method, named TAP-VL, which treats OCR information as a distinct modality and seamlessly integrates it into any VL model. TAP-VL employs a lightweight transformer-based OCR module to receive OCR with layout information, compressing it into a short fixed-length sequence for input into the LLM. Initially, we conduct model-agnostic pretraining of the OCR module on unlabeled documents, followed by its integration into any VL architecture through brief fine-tuning. Extensive experiments demonstrate consistent performance improvements when applying TAP-VL to top-performing VL models, across scene-text and document-based VL benchmarks.

本研究针对视觉-语言模型在处理图像中的文本信息时面临的挑战，提出了一种新方法TAP-VL，能够将光学字符识别（OCR）信息作为一种独立的模态并与视觉-语言模型无缝集成。通过轻量级的转化器基础OCR模块的预训练和微调，TAP-VL在多个基准测试上显著提高了VL模型的性能，展现出其在图像理解中的潜在影响。

文本布局感知预训练的丰富视觉-语言模型