Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. weakly supervised phrase-grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any