Weakly supervised phrase grounding aims at learning region-phrase correspondences using only image-sentence pairs. A major challenge thus lies in the missing links between image regions and sentence phrases during training. To address this challenge, we leverage a generic object detector at training time, and propose a contrastive learning framework that accounts for both region-phrase and image-sentence matching. Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed. Importantly, our region-phrase score function is learned by distilling from soft matching scores between the detected object class names and candidate phrases within an image-sentence pair, while the image-sentence score function is supervised by ground-truth image-sentence pairs. The design of such score functions removes the need of object detection at test time, thereby significantly reducing the inference cost. Without bells and whistles, our approach achieves state-of-the-art results on the task of visual phrase grounding, surpassing previous methods that require expensive object detectors at test time.

本文提出了一种基于弱监督学习、对比学习和目标检测的短语定位方法，旨在解决训练过程中图像区域和句子短语之间缺失链接的问题，并在消除了测试时的目标检测需求后，取得了在视觉短语定位方面的最新成果。

通过对比知识蒸馏提高弱监督视觉定位