Recently, Multimodal Large Language Models (MLLMs) that enable Large Language
Models (LLMs) to interpret images through visual instruction tuning have
achieved significant success. However, existing visual instruction tuning
methods only utilize image-language instruction data to align the language and
image modalities, lacking a more fine-grained cross-modal alignment. In this
paper, we propose Position-enhanced Visual Instruction Tuning (PVIT), which
extends the functionality of MLLMs by integrating an additional region-level
vision encoder. This integration promotes a more detailed comprehension of
images for the MLLM. In addition, to efficiently achieve a fine-grained
alignment between the vision modules and the LLM, we design multiple data
generation strategies to construct an image-region-language instruction
dataset. Finally, we present both quantitative experiments and qualitative
analysis that demonstrate the superiority of the proposed model. Code and data
will be released at this https URL

通过引入区域级别的视觉编码器，本文提出了一种增强图像教学调整功能的多模态大型语言模型（MLLMs），以实现更细粒度的模态交叉对齐，并设计了多种数据生成策略构建了图像 - 区域 - 语言指令数据集，实验结果表明该模型的卓越性能。