Endowing Large Multimodal Models (LMMs) with visual grounding capability can
significantly enhance AIs' understanding of the visual world and their
interaction with humans. However, existing methods typically fine-tune the
parameters of LMMs to learn additional segmentation tokens and overfit
grounding and segmentation datasets. Such a design would inevitably cause a
catastrophic diminution in the indispensable conversational capability of
general AI assistants. In this paper, we comprehensively evaluate
state-of-the-art grounding LMMs across a suite of multimodal question-answering
benchmarks, observing pronounced performance drops that indicate vanishing
general knowledge comprehension and weakened instruction following ability. To
address this issue, we present F-LMM -- grounding frozen off-the-shelf LMMs in
human-AI conversations -- a straightforward yet effective design based on the
fact that word-pixel correspondences conducive to visual grounding inherently
exist in the attention weights of well-trained LMMs. Using only a few trainable
CNN layers, we can translate word-pixel attention weights to mask logits, which
a SAM-based mask refiner can further optimise. Our F-LMM neither learns special
segmentation tokens nor utilises high-quality grounded instruction-tuning data,
but achieves competitive performance on referring expression segmentation and
panoptic narrative grounding benchmarks while completely preserving LMMs'
original conversational ability. Additionally, with instruction-following
ability preserved and grounding ability obtained, our F-LMM can perform visual
chain-of-thought reasoning and better resist object hallucinations.

通过冻结已训练好的 Large Multimodal Models（LMMs）并结合人机对话，我们提出了一种简单且有效的设计 F-LMM，可以在完全保留 LMMs 的通话能力的同时，在指示物镜分割和全景叙述理解等测试中实现有竞争力的性能。