This paper presents an approach for grounding phrases in images which jointly learns multiple text-conditioned embeddings in a single end-to-end model. In order to differentiate text phrases into semantically distinct subspaces, we propose a concept weight branch that automatically assigns phrases to embeddings, whereas prior works predefine such assignments. Our proposed solution simplifies the representation requirements for individual embeddings and allows the underrepresented concepts to take advantage of the shared representations before feeding them into concept-specific layers. Comprehensive experiments verify the effectiveness of our approach across three phrase grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, where we obtain a (resp.) 3.5%, 2%, and 3.5% improvement in grounding performance over a strong region-phrase embedding baseline.

本文提出了一种基于图像的短语 grounding 方法，基于一个端到端模型的多重条件嵌入来实现。为了将文本短语划分为语义上的不同子空间，我们提出了一个概念权重分支，可以自动将短语分配到嵌入，而不是像传统方法一样预先定义这些分配。我们的方法简化了个体嵌入的表征需求，并允许未被充分表示的概念在输入到概念特定层之前充分利用共享表示。在三个短语 grounding 数据集上的综合实验验证了我们方法的有效性，从而获得了强大的区域-短语嵌入基线 4％，3％和4％ 的性能改进。

条件图像-文本嵌入网络