Supervised or weakly supervised methods for phrase localization (textual
grounding) either rely on human annotations or some other supervised models,
e.g., object detectors. Obtaining these annotations is labor-intensive and may
be difficult to scale in practice. We propose to leverage recent advances in
contrastive language-vision models, CLIP, pre-trained on image and caption
pairs collected from the internet. In its original form, CLIP only outputs an
image-level embedding without any spatial resolution. We adapt CLIP to generate
high-resolution spatial feature maps. Importantly, we can extract feature maps
from both ViT and ResNet CLIP model while maintaining the semantic properties
of an image embedding. This provides a natural framework for phrase
localization. Our method for phrase localization requires no human annotations
or additional training. Extensive experiments show that our method outperforms
existing no-training methods in zero-shot phrase localization, and in some
cases, it even outperforms supervised methods. Code is available at
this https URL .

利用对比语言 - 视觉模型 CLIP，我们可以实现无需人工注释或额外训练的短语定位方法，其零样本短语定位性能优于现有无训练方法，并在某些情况下甚至超过了有监督的方法。