BriefGPT.xyz
Dec, 2023
构建大型多模态模型理解任意视觉提示
Making Large Multimodal Models Understand Arbitrary Visual Prompts
HTML
PDF
Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai...
TL;DR
该研究介绍了一种新颖的多模态模型,可以解码任意视觉提示,通过在RGB图像上直接叠加视觉标记的方式,实现了对特定区域的理解,在区域理解任务上取得了最先进的性能,并提出了ViP-Bench,一个综合评估模型在理解多个维度上的视觉提示能力的基准,为未来的研究提供了可能。
Abstract
While existing large
vision-language multimodal models
focus on whole image understanding, there is a prominent gap in achieving
region-specific comprehension
. Current approaches that use textual coordinates or s
→