Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Although many data sources contain images which are described with sentences or phrases, they typically do not provide the spatial localization of the phrases. This is true for both curated datasets such as MSCOCO or large user generated content as e.g. in the YFCC 100M dataset. Consequently, being able to learn from this data without grounding supervision would allow large amount and variety of training data. For this setting we propose GroundeR, a novel approach, which is able to learn the grounding by aiming to reconstruct a given phrase using an attention mechanism. More specifically, during training time, the model encodes the phrase using an LSTM, and then has to learn to attend to the relevant image region in order to reconstruct the input phrase. At test time the correct attention, i.e. the grounding is evaluated. On the Flickr 30k Entities dataset our approach outperforms prior work which, in contrast to us, trains with the grounding (bounding box) annotations.

通过采用注意力机制来重构给定的短语，本论文提出了一种新的接近无监督学习的方法来学习 grounding，该方法不需要太多的地面实时监督，有效提高了在 Flickr 30k 实体数据集上的表现。

文本短语重建图像基础