Multimodal Large Language Model (MLLMs) leverages Large Language Models as a
cognitive framework for diverse visual-language tasks. Recent efforts have been
made to equip MLLMs with visual perceiving and grounding capabilities. However,
there still remains a gap in providing fine-grained pixel-level perceptions and
extending interactions beyond text-specific inputs. In this work, we propose
{\bf{AnyRef}}, a general MLLM model that can generate pixel-wise object
perceptions and natural language descriptions from multi-modality references,
such as texts, boxes, images, or audio. This innovation empowers users with
greater flexibility to engage with the model beyond textual and regional
prompts, without modality-specific designs. Through our proposed refocusing
mechanism, the generated grounding output is guided to better focus on the
referenced object, implicitly incorporating additional pixel-level supervision.
This simple modification utilizes attention scores generated during the
inference of LLM, eliminating the need for extra computations while exhibiting
performance enhancements in both grounding masks and referring expressions.
With only publicly available training data, our model achieves state-of-the-art
results across multiple benchmarks, including diverse modality referring
segmentation and region-level referring expression generation.

提出了 AnyRef 模型，它能从多模态参考中生成像素级的物体感知和自然语言描述，从而提供更大的灵活性，超越了文本和区域提示，无需特定的设计。通过提出的重新聚焦机制，生成的定位输出可以更好地聚焦在参考对象上，从而隐含地融入了像素级的监督。该模型在多个基准测试中取得了最先进的结果，包括多模态参考分割和区域级参考表达生成。