We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded
and fine-grained visual perception ability. Beyond holistic image
understanding, Groma is adept at region-level tasks such as region captioning
and visual grounding. Such capabilities are built upon a localized visual
tokenization mechanism, where an image input is decomposed into regions of
interest and subsequently encoded into region tokens. By integrating region
tokens into user instructions and model responses, we seamlessly enable Groma
to understand user-specified region inputs and ground its textual output to
images. Besides, to enhance the grounded chat ability of Groma, we curate a
visually grounded instruction dataset by leveraging the powerful GPT-4V and
visual prompting techniques. Compared with MLLMs that rely on the language
model or external module for localization, Groma consistently demonstrates
superior performances in standard referring and grounding benchmarks,
highlighting the advantages of embedding localization into image tokenization.
Project page: this https URL

Groma 是一个多模式大型语言模型，具有以图像感知为基础的细粒度视觉理解能力。它能够执行区域级任务并将图像与文字进行关联，通过在图像中定位兴趣区域并将其编码成区域标记的方式实现。此外，Groma 还利用 GPT-4V 和视觉提示技术创建了一个视觉基准数据集，使其在基准测试中表现出优越的对话能力。

Groma：针对多模态大语言模型的本地化视觉标记

Groma: Localized Visual Tokenization for Grounding Multimodal Large  Language Models

Vision language models (VLMs) have experienced rapid advancements through the
integration of large language models (LLMs) with image-text pairs, yet they
struggle with detailed regional visual understanding due to limited spatial
awareness of the vision encoder, and the use of coarse-grained training data
that lacks detailed, region-specific captions. To address this, we introduce
RegionGPT (short as RGPT), a novel framework designed for complex region-level
captioning and understanding. RGPT enhances the spatial awareness of regional
representation with simple yet effective modifications to existing visual
encoders in VLMs. We further improve performance on tasks requiring a specific
output scope by integrating task-guided instruction prompts during both
training and inference phases, while maintaining the model's versatility for
general-purpose tasks. Additionally, we develop an automated region caption
data generation pipeline, enriching the training set with detailed region-level
captions. We demonstrate that a universal RGPT model can be effectively applied
and significantly enhancing performance across a range of region-level tasks,
including but not limited to complex region descriptions, reasoning, object
classification, and referring expressions comprehension.

区域语言模型 (RegionGPT) 是一种新的框架，通过改进视觉编码器的空间感知能力以及集成任务导向指令提示来实现复杂的区域级标题生成和理解，提高在复杂区域描述、推理、对象分类和引用表达理解等区域级任务上的性能。