We present a unified, promptable model capable of simultaneously segmenting,
recognizing, and captioning anything. Unlike SAM, we aim to build a versatile
region representation in the wild via visual prompting. To achieve this, we
train a generalizable model with massive segmentation masks, e.g., SA-1B masks,
and semantic priors from a pre-trained CLIP model with 5 billion parameters.
Specifically, we construct a promptable image decoder by adding a semantic
token to each mask token. The semantic token is responsible for learning the
semantic priors in a predefined concept space. Through joint optimization of
segmentation on mask tokens and concept prediction on semantic tokens, our
model exhibits strong regional recognition and localization capabilities. For
example, an additional 38M-parameter causal text decoder trained from scratch
sets a new record with a CIDEr score of 150.7 on the Visual Genome region
captioning task. We believe this model can be a versatile region-level image
tokenizer, capable of encoding general-purpose region context for a broad range
of perception tasks. Code and models are available at
this https URL

我们提出了一个统一的、可提示的模型，能够同时分割、识别和描述任何物体。与 SAM 不同的是，我们通过视觉提示在野外构建多用途区域表示。我们使用来自具有 50 亿参数的经过预训练的 CLIP 模型的大规模分割掩码，例如 SA-1B 掩码，和语义先验训练一个可泛化的模型。通过将语义令牌添加到每个掩码令牌中，我们构建了一个可提示的图像解码器，语义令牌负责在预定义的概念空间中学习语义先验。通过在掩码令牌上进行分割和在语义令牌上进行概念预测的联合优化，我们的模型展现了强大的区域识别和定位能力。通过从头开始训练一个 3800 万参数的因果文本解码器，我们在 Visual Genome 区域描述任务上取得了 CIDEr 得分 150.7 的新记录。我们认为这个模型可以作为一个多功能区域级图像分词器，能够为广泛的感知任务编码通用区域上下文。

通过提示实现任意分词

Tokenize Anything via Prompting

We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of
understanding spatial referring of any shape or granularity within an image and
accurately grounding open-vocabulary descriptions. To unify referring and
grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid
region representation that integrates discrete coordinates and continuous
features jointly to represent a region in the image. To extract the continuous
features of versatile regions, we propose a spatial-aware visual sampler, adept
at handling varying sparsity across different shapes. Consequently, Ferret can
accept diverse region inputs, such as points, bounding boxes, and free-form
shapes. To bolster the desired capability of Ferret, we curate GRIT, a
comprehensive refer-and-ground instruction tuning dataset including 1.1M
samples that contain rich hierarchical spatial knowledge, with 95K hard
negative data to promote model robustness. The resulting model not only
achieves superior performance in classical referring and grounding tasks, but
also greatly outperforms existing MLLMs in region-based and
localization-demanded multimodal chatting. Our evaluations also reveal a
significantly improved capability of describing image details and a remarkable
alleviation in object hallucination. Code and data will be available at
this https URL

我们介绍了 Ferret，这是一个新的多模态大型语言模型（MLLM），能够理解图像中任何形状或粒度的空间引用，并准确地确定开放词汇的描述。