Image-text retrieval is one of the major tasks of cross-modal retrieval. Several approaches for this task map images and texts into a common space to create correspondences between the two modalities. However, due to the content (semantics) richness of an image, redundant secondary information in an image may cause false matches. To address this issue, this paper presents a semantic optimization approach, implemented as a Visual Semantic Loss (VSL), to assist the model in focusing on an image's main content. This approach is inspired by how people typically annotate the content of an image by describing its main content. Thus, we leverage the annotated texts corresponding to an image to assist the model in capturing the main content of the image, reducing the negative impact of secondary content. Extensive experiments on two benchmark datasets (MSCOCO and Flickr30K) demonstrate the superior performance of our method. The code is available at: https://github.com/ZhangXu0963/VSL.

本文提出了一种语义优化方法，称为视觉语义损失（VSL），以辅助模型专注于图像的主要内容，通过对图像的注释文本的利用，减少次要内容的负面影响，通过两个基准数据集（MSCOCO和Flickr30K）的大量实验，证明了该方法的卓越性能。

通过保留视觉主要语义实现图像文本检索