Recent Vision-Language Pre-training (VLP) models have demonstrated significant advancements. Nevertheless, these models heavily rely on image-text pairs that capture only coarse and global information of an image, leading to a limitation in their regional understanding ability. In this work, we introduce \textbf{RegionVLM}, equipped with explicit regional modeling capabilities, allowing them to understand user-indicated image regions. To achieve this, we design a simple yet innovative architecture, requiring no modifications to the model architecture or objective function. Additionally, we leverage a dataset that contains a novel source of information, namely Localized Narratives, which has been overlooked in previous VLP research. Our experiments demonstrate that our single generalist model not only achieves an interactive dialogue system but also exhibits superior performance on various zero-shot region understanding tasks, without compromising its ability for global image understanding.

通过引入具有明确区域建模能力的RegionVLM模型，并利用包含区域信息的Localized Narratives数据集，我们的实验表明，我们的单一通用模型不仅实现了交互式对话系统，还在各种零样本区域理解任务上展现出了卓越的性能，而不会损害其对全局图像的理解能力。

在视觉-语言模型中实现交互式区域理解