Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an "attend-and-segment" method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: https://groundLMM.github.io.

当前大型多模态模型面临着定位语言组件与视觉实体之间关系的挑战。本文提出了一种“关注与分割”的方法，展示了在无明确定位监督的情况下，模型可以自发地培养出基础能力，并通过引入基于扩散的视觉编码器，提升了模型的定位能力。研究结果表明，我们的方法在定位会话生成任务中未使用任何定位监督，仍表现出竞争力，在基础面具召回率上超过了大量监督模型。 

无监督基础下的大型多模态模型中的新兴像素定位