Image-text retrieval is a widely studied topic in the field of computer vision due to the exponential growth of multimedia data, whose core concept is to measure the similarity between images and text. However, most existing retrieval methods heavily rely on cross-attention mechanisms for cross-modal fine-grained alignment, which takes into account excessive irrelevant regions and treats prominent and non-significant words equally, thereby limiting retrieval accuracy. This paper aims to investigate an alignment approach that reduces the involvement of non-significant fragments in images and text while enhancing the alignment of prominent segments. For this purpose, we introduce the Cross-Modal Prominent Fragments Enhancement Aligning Network(CPFEAN), which achieves improved retrieval accuracy by diminishing the participation of irrelevant regions during alignment and relatively increasing the alignment similarity of prominent words. Additionally, we incorporate prior textual information into image regions to reduce misalignment occurrences. In practice, we first design a novel intra-modal fragments relationship reasoning method, and subsequently employ our proposed alignment mechanism to compute the similarity between images and text. Extensive quantitative comparative experiments on MS-COCO and Flickr30K datasets demonstrate that our approach outperforms state-of-the-art methods by about 5% to 10% in the rSum metric.

通过降低非重要图片和文本片段的参与度，提高对重要片段的对齐相似性，本文介绍了一种新的跨模态突出片段增强对齐网络(CPFEAN)，该网络通过减少在对齐过程中无关区域的参与度并相对提高对齐的突出词，从而实现改进的检索准确性。与最先进的方法相比，在MS-COCO和Flickr30K数据集上进行了大量定量比较实验，结果显示本方法在rSum度量上的表现超过了现有方法约5%至10%。

跨模态突出片段增强对齐网络：图像-文本检索