Visual grounding, a crucial vision-language task involving the understanding
of the visual context based on the query expression, necessitates the model to
capture the interactions between objects, as well as various spatial and
attribute information. However, the annotation data of visual grounding task is
limited due to its time-consuming and labor-intensive annotation process,
resulting in the trained models being constrained from generalizing its
capability to a broader domain. To address this challenge, we propose
GroundVLP, a simple yet effective zero-shot method that harnesses visual
grounding ability from the existing models trained from image-text pairs and
pure object detection data, both of which are more conveniently obtainable and
offer a broader domain compared to visual grounding annotation data. GroundVLP
proposes a fusion mechanism that combines the heatmap from GradCAM and the
object proposals of open-vocabulary detectors. We demonstrate that the proposed
method significantly outperforms other zero-shot methods on RefCOCO/+/g
datasets, surpassing prior zero-shot state-of-the-art by approximately 28\% on
the test split of RefCOCO and RefCOCO+. Furthermore, GroundVLP performs
comparably to or even better than some non-VLP-based supervised models on the
Flickr30k entities dataset. Our code is available at
this https URL

通过现有的图像 - 文本配对模型和纯物体检测数据，我们提出了一种名为 GroundVLP 的简单而有效的零样本方法，该方法结合了 GradCAM 热力图和开放词汇检测器的对象提案，用于捕捉视觉环境并解决视觉定位任务中数据标注不足的挑战，实验结果显示该方法在 RefCOCO/+/g 数据集上超过了现有零样本方法的 28％，并且在 Flickr30k 实体数据集上与一些非 VLP 的有监督模型表现相当甚至更好。

GroundVLP：从视觉语言预训练和开放词汇对象检测中利用零样本视觉定位

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language  Pre-training and Open-Vocabulary Object Detection

Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive
zero-shot transfer capabilities in image-level visual perception. However,
these models have shown limited performance in instance-level tasks that demand
precise localization and recognition. Previous works have suggested that
incorporating visual prompts, such as colorful boxes or circles, can improve
the ability of models to recognize objects of interest. Nonetheless, compared
to language prompting, visual prompting designs are rarely explored. Existing
approaches, which employ coarse visual cues such as colorful boxes or circles,
often result in sub-optimal performance due to the inclusion of irrelevant and
noisy pixels. In this paper, we carefully study the visual prompting designs by
exploring more fine-grained markings, such as segmentation masks and their
variations. In addition, we introduce a new zero-shot framework that leverages
pixel-level annotations acquired from a generalist segmentation model for
fine-grained visual prompting. Consequently, our investigation reveals that a
straightforward application of blur outside the target mask, referred to as the
Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting
strategy leverages the precise mask annotations to reduce focus on weakly
related regions while retaining spatial coherence between the target and the
surrounding background. Our Fine-Grained Visual Prompting (FGVP) demonstrates
superior performance in zero-shot comprehension of referring expressions on the
RefCOCO, RefCOCO+, and RefCOCOg benchmarks. It outperforms prior methods by an
average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the
RefCOCO+ testA subset. The part detection experiments conducted on the PACO
dataset further validate the preponderance of FGVP over existing visual
prompting techniques. Code and models will be made available.

本文介绍了一种新的零样本学习框架，Fine-Grained Visual Prompting（FGVP），通过使用精确的掩码注释来改进视觉提示设计，并展示了在不同的基准测试上均优于传统方法的性能表现。

细粒度视觉提示

Fine-Grained Visual Prompting

Referring image segmentation aims at localizing all pixels of the visual
objects described by a natural language sentence. Previous works learn to
straightforwardly align the sentence embedding and pixel-level embedding for
highlighting the referred objects, but ignore the semantic consistency of
pixels within the same object, leading to incomplete masks and localization
errors in predictions. To tackle this problem, we propose CoupAlign, a simple
yet effective multi-level visual-semantic alignment method, to couple
sentence-mask alignment with word-pixel alignment to enforce object mask
constraint for achieving more accurate localization and segmentation.
Specifically, the Word-Pixel Alignment (WPA) module performs early fusion of
linguistic and pixel-level features in intermediate layers of the vision and
language encoders. Based on the word-pixel aligned embedding, a set of mask
proposals are generated to hypothesize possible objects. Then in the
Sentence-Mask Alignment (SMA) module, the masks are weighted by the sentence
embedding to localize the referred object, and finally projected back to
aggregate the pixels for the target. To further enhance the learning of the two
alignment modules, an auxiliary loss is designed to contrast the foreground and
background pixels. By hierarchically aligning pixels and masks with linguistic
features, our CoupAlign captures the pixel coherence at both visual and
semantic levels, thus generating more accurate predictions. Extensive
experiments on popular datasets (e.g., RefCOCO and G-Ref) show that our method
achieves consistent improvements over state-of-the-art methods, e.g., about 2%
oIoU increase on the validation and testing set of RefCOCO. Especially,
CoupAlign has remarkable ability in distinguishing the target from multiple
objects of the same class.

提出了一种名为 CoupAlign 的多级视觉语义对齐方法，通过单词 - 像素对齐和句子 - 掩码对齐相结合的方式实现了对像素的更准确的定位和分割，可以在 RefCOCO 和 G-Ref 数据集上对同类多个对象进行分辨。

CoupAlign：耦合词素和像素的句子掩码对称，用于图像指代分割

CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for  Referring Image Segmentation

We improve one-stage visual grounding by addressing current limitations on
grounding long and complex queries. Existing one-stage methods encode the
entire language query as a single sentence embedding vector, e.g., taking the
embedding from BERT or the hidden state from LSTM. This single vector
representation is prone to overlooking the detailed descriptions in the query.
To address this query modeling deficiency, we propose a recursive sub-query
construction framework, which reasons between image and query for multiple
rounds and reduces the referring ambiguity step by step. We show our new
one-stage method obtains 5.0%, 4.5%, 7.5%, 12.8% absolute improvements over the
state-of-the-art one-stage baseline on ReferItGame, RefCOCO, RefCOCO+, and
RefCOCOg, respectively. In particular, superior performances on longer and more
complex queries validates the effectiveness of our query modeling.

提出一种递归子查询构建框架，解决当前一阶段视觉基础的限制，提高了长而复杂查询的精度，效果比现有一阶段基线模型在多个基准数据集上都有显著的提高。