Visual grounding, a crucial vision-language task involving the understanding of the visual context based on the query expression, necessitates the model to capture the interactions between objects, as well as various spatial and attribute information. However, the annotation data of visual grounding task is limited due to its time-consuming and labor-intensive annotation process, resulting in the trained models being constrained from generalizing its capability to a broader domain. To address this challenge, we propose GroundVLP, a simple yet effective zero-shot method that harnesses visual grounding ability from the existing models trained from image-text pairs and pure object detection data, both of which are more conveniently obtainable and offer a broader domain compared to visual grounding annotation data. GroundVLP proposes a fusion mechanism that combines the heatmap from GradCAM and the object proposals of open-vocabulary detectors. We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets, surpassing prior zero-shot state-of-the-art by approximately 28\% on the test split of RefCOCO and RefCOCO+. Furthermore, GroundVLP performs comparably to or even better than some non-VLP-based supervised models on the Flickr30k entities dataset. Our code is available at https://github.com/om-ai-lab/GroundVLP.

通过现有的图像-文本配对模型和纯物体检测数据，我们提出了一种名为GroundVLP的简单而有效的零样本方法，该方法结合了GradCAM热力图和开放词汇检测器的对象提案，用于捕捉视觉环境并解决视觉定位任务中数据标注不足的挑战，实验结果显示该方法在RefCOCO/+/g数据集上超过了现有零样本方法的28％，并且在Flickr30k实体数据集上与一些非VLP的有监督模型表现相当甚至更好。

GroundVLP：从视觉语言预训练和开放词汇对象检测中利用零样本视觉定位