Supervised or weakly supervised methods for phrase localization (textual
grounding) either rely on human annotations or some other supervised models,
e.g., object detectors. Obtaining these annotations is labor-intensive and may
be difficult to scale in practice. We propose to leverage