Recently, the advent of Large Visual-Language Models (LVLMs) has received
increasing attention across various domains, particularly in the field of
visual document understanding (VDU). Different from conventional
vision-language tasks, VDU is specifically concerned with text-rich scenarios
containing abundant document elements. Nevertheless, the importance of
fine-grained features remains largely unexplored within the community of LVLMs,
leading to suboptimal performance in text-rich scenarios. In this paper, we
abbreviate it as the fine-grained feature collapse issue. With the aim of
filling this gap, we propose a contrastive learning framework, termed Document
Object COntrastive learning (DoCo), specifically tailored for the downstream
tasks of VDU. DoCo leverages an auxiliary multimodal encoder to obtain the
features of document objects and align them to the visual features generated by
the vision encoder of LVLM, which enhances visual representation in text-rich
scenarios. It can represent that the contrastive learning between the visual
holistic representations and the multimodal fine-grained features of document
objects can assist the vision encoder in acquiring more effective visual cues,
thereby enhancing the comprehension of text-rich documents in LVLMs. We also
demonstrate that the proposed DoCo serves as a plug-and-play pre-training
method, which can be employed in the pre-training of various LVLMs without
inducing any increase in computational complexity during the inference process.
Extensive experimental results on multiple benchmarks of VDU reveal that LVLMs
equipped with our proposed DoCo can achieve superior performance and mitigate
the gap between VDU and generic vision-language tasks.

利用对比学习框架 DoCo，该研究填补了大型视觉 - 语言模型在处理富文本场景中的细粒度特征缺失问题，提高了对文本丰富的文档的视觉表示，并在多个视觉文档理解基准上取得了优越的性能。

大型视觉 - 语言模型中利用对比学习增强视觉文档理解

Enhancing Visual Document Understanding with Contrastive Learning in  Large Visual-Language Models

In the field of document understanding, significant advances have been made
in the fine-tuning of Multimodal Large Language Models (MLLMs) with
instruction-following data. Nevertheless, the potential of text-grounding
capability within text-rich scenarios remains underexplored. In this paper, we
present a text-grounding document understanding model, termed TGDoc, which
addresses this deficiency by enhancing MLLMs with the ability to discern the
spatial positioning of text within images. Empirical evidence suggests that
text-grounding improves the model's interpretation of textual content, thereby
elevating its proficiency in comprehending text-rich images. Specifically, we
compile a dataset containing 99K PowerPoint presentations sourced from the
internet. We formulate instruction tuning tasks including text detection,
recognition, and spotting to facilitate the cohesive alignment between the
visual encoder and large language model. Moreover, we curate a collection of
text-rich images and prompt the text-only GPT-4 to generate 12K high-quality
conversations, featuring textual locations within text-rich scenarios. By
integrating text location data into the instructions, TGDoc is adept at
discerning text locations during the visual question process. Extensive
experiments demonstrate that our method achieves state-of-the-art performance
across multiple text-rich benchmarks, validating the effectiveness of our
method.

在文档理解领域，本文提出了一种文本定位的文档理解模型，命名为 TGDoc，通过增强多模态大型语言模型（MLLMs）的能力来识别图像内文本的空间位置，以提高文本内容解释的准确性，从而提高对文本丰富图像的理解能力。实验证据表明，文本定位方法在多个文本丰富基准测试中取得了最先进的性能，验证了我们方法的有效性。