Despite the evolution of deep-learning-based visual-textual processing
systems, precise multi-modal matching remains a challenging task. In this work,
we tackle the task of cross-modal retrieval through image-sentence matching
based on word-region alignments, using supervision only at the global
image-sentence level. Specifically, we present a novel approach called
Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a
fine-grained match between the underlying components of images and sentences,
i.e., image regions and words, respectively, in order to preserve the
informative richness of both modalities. TERAN obtains state-of-the-art results
on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover,
on MS-COCO, it also outperforms current approaches on the sentence retrieval
task.
Focusing on scalable cross-modal information retrieval, TERAN is designed to
keep the visual and textual data pipelines well separated. Cross-attention
links invalidate any chance to separately extract visual and textual features
needed for the online search and the offline indexing steps in large-scale
retrieval systems. In this respect, TERAN merges the information from the two
domains only during the final alignment phase, immediately before the loss
computation. We argue that the fine-grained alignments produced by TERAN pave
the way towards the research for effective and efficient methods for
large-scale cross-modal information retrieval. We compare the effectiveness of
our approach against relevant state-of-the-art methods. On the MS-COCO 1K test
set, we obtain an improvement of 5.7% and 3.5% respectively on the image and
the sentence retrieval tasks on the Recall@1 metric. The code used for the
experiments is publicly available on GitHub at
this https URL

通过词区匹配实现图像 - 句子匹配，本文提出了一种名为 TERAN 的新方法，在图像和句子的不同组件之间执行精细匹配，从而实现了跨模式检索，并在 MS-COCO 和 Flickr30k 数据集上获得了最先进的结果。

使用 Transformer 编码器进行跨模态检索的细粒度视觉文本对齐

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using  Transformer Encoders

This paper studies the task of matching image and sentence, where learning
appropriate representations across the multi-modal data appears to be the main
challenge. Unlike previous approaches that predominantly deploy symmetrical
architecture to represent both modalities, we propose Saliency-guided Attention
Network (SAN) that asymmetrically employs visual and textual attention modules
to learn the fine-grained correlation intertwined between vision and language.
The proposed SAN mainly includes three components: saliency detector,
Saliency-weighted Visual Attention (SVA) module, and Saliency-guided Textual
Attention (STA) module. Concretely, the saliency detector provides the visual
saliency information as the guidance for the two attention modules. SVA is
designed to leverage the advantage of the saliency information to improve
discrimination of visual representations. By fusing the visual information from
SVA and textual information as a multi-modal guidance, STA learns
discriminative textual representations that are highly sensitive to visual
clues. Extensive experiments demonstrate SAN can substantially improve the
state-of-the-art results on the benchmark Flickr30K and MSCOCO datasets by a
large margin.

该研究旨在探讨图像和句子之间的匹配问题，提出了一种 Saliency-guided Attention Network 架构，包括视觉注意力和文本注意力模块，能够有效地提高多模态数据表示的准确性，并在 Flickr30K 和 MSCOCO 数据集上取得了大幅度的提升。