Large Vision Language Models have achieved fine-grained object perception,
but the limitation of image resolution remains a significant obstacle to
surpass the performance of task-specific experts in complex and dense
scenarios. Such limitation further restricts the model's potential to achieve
nuanced visual and language referring in domains such as GUI Agents, Counting
and \etc. To address this issue, we introduce a unified high-resolution
generalist model, Griffon v2, enabling flexible object referring with visual
and textual prompts. To efficiently scaling up image resolution, we design a
simple and lightweight down-sampling projector to overcome the input tokens
constraint in Large Language Models. This design inherently preserves the
complete contexts and fine details, and significantly improves multimodal
perception ability especially for small objects. Building upon this, we further
equip the model with visual-language co-referring capabilities through a
plug-and-play visual tokenizer. It enables user-friendly interaction with
flexible target images, free-form texts and even coordinates. Experiments
demonstrate that Griffon v2 can localize any objects of interest with visual
and textual referring, achieve state-of-the-art performance on REC, phrase
grounding, and REG tasks, and outperform expert models in object detection and
object counting. Data, codes and models will be released at
this https URL

Griffon v2, a high-resolution generalist model, overcomes image resolution limitations in large vision language models to achieve nuanced visual and language referring, and outperforms expert models in object detection and counting.

Griffon v2: 提升高分辨率缩放和视觉语言共识的多模态感知

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling  and Visual-Language Co-Referring

We investigate the problem of object referring (OR) i.e. to localize a target
object in a visual scene coming with a language description. Humans perceive
the world more as continued video snippets than as static images, and describe
objects not only by their appearance, but also by their spatio-temporal context
and motion features. Humans also gaze at the object when they issue a referring
expression. Existing works for OR mostly focus on static images only, which
fall short in providing many such cues. This paper addresses OR in videos with
language and human gaze. To that end, we present a new video dataset for OR,
with 30, 000 objects over 5, 000 stereo video sequences annotated for their
descriptions and gaze. We further propose a novel network model for OR in
videos, by integrating appearance, motion, gaze, and spatio-temporal context
into one network. Experimental results show that our method effectively
utilizes motion cues, human gaze, and spatio-temporal context. Our method
outperforms previousOR methods. For dataset and code, please refer
this https URL

本研究提出了一种利用视频中物体的运动特征、人眼注视和时空语境等信息进行对象指称的新型神经网络模型，并使用一个包含 30,000 个对象的测试数据集验证了该模型的有效性。

视频中的对象指称：基于语言和人类凝视

Object Referring in Videos with Language and Human Gaze

Object referring has important applications, especially for human-machine
interaction. While having received great attention, the task is mainly attacked
with written language (text) as input rather than spoken language (speech),
which is more natural. This paper investigates Object Referring with Spoken
Language (ORSpoken) by presenting two datasets and one novel approach. Objects
are annotated with their locations in images, text descriptions and speech
descriptions. This makes the datasets ideal for multi-modality learning. The
approach is developed by carefully taking down ORSpoken problem into three
sub-problems and introducing task-specific vision-language interactions at the
corresponding levels. Experiments show that our method outperforms competing
methods consistently and significantly. The approach is also evaluated in the
presence of audio noise, showing the efficacy of the proposed vision-language
interaction methods in counteracting background noise.

本文探讨了用口语作为输入的物体指称（ORSpoken），通过介绍两个数据集和一种新的方法来为多模式学习提供了理想的数据集，并在相应的层次引入任务特定的视觉语言交互，实验表明我们的方法在减轻背景噪声方面具有很好的效果。