While existing large vision-language multimodal models focus on whole image
understanding, there is a prominent gap in achieving region-specific
comprehension. Current approaches that use textual coordinates or spatial
encodings often fail to provide a user-friendly interface for visual prompting.
To address this challenge, we introduce a novel multimodal model capable of
decoding arbitrary visual prompts. This allows users to intuitively mark images
and interact with the model using natural cues like a "red bounding box" or
"pointed arrow". Our simple design directly overlays visual markers onto the
RGB image, eliminating the need for complex region encodings, yet achieves
state-of-the-art performance on region-understanding tasks like Visual7W,
PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present
ViP-Bench, a comprehensive benchmark to assess the capability of models in
understanding visual prompts across multiple dimensions, enabling future
research in this domain. Code, data, and model are publicly available.

该研究介绍了一种新颖的多模态模型，可以解码任意视觉提示，通过在 RGB 图像上直接叠加视觉标记的方式，实现了对特定区域的理解，在区域理解任务上取得了最先进的性能，并提出了 ViP-Bench，一个综合评估模型在理解多个维度上的视觉提示能力的基准，为未来的研究提供了可能。