We study the visual semantic embedding problem for image-text matching. Most
existing work utilizes a tailored cross-attention mechanism to perform local
alignment across the two image and text modalities. This is computationally
expensive, even though it is more powerful than the unimodal dual-encoder
approach. This work introduces a dual-encoder image-text matching model,
leveraging a scene graph to represent captions with nodes for objects and
attributes interconnected by relational edges. Utilizing a graph attention
network, our model efficiently encodes object-attribute and object-object
semantic relations, resulting in a robust and fast-performing system.
Representing caption as a scene graph offers the ability to utilize the strong
relational inductive bias of graph neural networks to learn object-attribute
and object-object relations effectively. To train the model, we propose losses
that align the image and caption both at the holistic level (image-caption) and
the local level (image-object entity), which we show is key to the success of
the model. Our model is termed Composition model for Object Relations and
Attributes, CORA. Experimental results on two prominent image-text retrieval
benchmarks, Flickr30K and MSCOCO, demonstrate that CORA outperforms existing
state-of-the-art computationally expensive cross-attention methods regarding
recall score while achieving fast computation speed of the dual encoder.

本研究中，我们通过引入场景图表示图像标题，利用图注意力网络构建了一个双编码器的图像 - 文本匹配模型，能高效地编码物体 - 属性和物体 - 物体的语义关系，通过提供对图神经网络的强关系归纳偏置进行学习。我们的模型在两个重要的图像 - 文本检索基准数据集 Flickr30K 和 MSCOCO 上进行实验，证明了相对于计算成本高的交叉注意方法，CORA 在召回得分上具有优势，同时实现了双编码器的快速计算速度。

组合对象关系和属性进行图像 - 文本匹配

Composing Object Relations and Attributes for Image-Text Matching

Visual Semantic Embedding (VSE) is a dominant approach for vision-language
retrieval, which aims at learning a deep embedding space such that visual data
are embedded close to their semantic text labels or descriptions. Recent VSE
models use complex methods to better contextualize and aggregate multi-modal
features into holistic embeddings. However, we discover that surprisingly
simple (but carefully selected) global pooling functions (e.g., max pooling)
outperform those complex models, across different feature extractors. Despite
its simplicity and effectiveness, seeking the best pooling function for
different data modality and feature extractor is costly and tedious, especially
when the size of features varies (e.g., text, video). Therefore, we propose a
Generalized Pooling Operator (GPO), which learns to automatically adapt itself
to the best pooling strategy for different features, requiring no manual tuning
while staying effective and efficient. We extend the VSE model using this
proposed GPO and denote it as VSE$\infty$.
Without bells and whistles, VSE$\infty$ outperforms previous VSE methods
significantly on image-text retrieval benchmarks across popular feature
extractors. With a simple adaptation, variants of VSE$\infty$ further
demonstrate its strength by achieving the new state of the art on two
video-text retrieval datasets. Comprehensive experiments and visualizations
confirm that GPO always discovers the best pooling strategy and can be a
plug-and-play feature aggregation module for standard VSE models. Code and
pre-trained models are available at this https URL

Visual Semantic Embedding 使用多模态特征进行复杂的嵌入，但该研究发现，全局池化函数（例如最大池化）的简单选择优于这些复杂的模型，因此，它们提出了一个名为 GPO 的广义池化运算符来自动适应不同特征的最佳池化策略，并将其扩展到 VSE 模型中以获得最佳结果。