Image-text retrieval has developed rapidly in recent years. However, it is still a challenge in remote sensing due to visual-semantic imbalance, which leads to incorrect matching of non-semantic visual and textual features. To solve this problem, we propose a novel Direction-Oriented Visual-semantic Embedding Model (DOVE) to mine the relationship between vision and language. Concretely, a Regional-Oriented Attention Module (ROAM) adaptively adjusts the distance between the final visual and textual embeddings in the latent semantic space, oriented by regional visual features. Meanwhile, a lightweight Digging Text Genome Assistant (DTGA) is designed to expand the range of tractable textual representation and enhance global word-level semantic connections using less attention operations. Ultimately, we exploit a global visual-semantic constraint to reduce single visual dependency and serve as an external constraint for the final visual and textual representations. The effectiveness and superiority of our method are verified by extensive experiments including parameter evaluation, quantitative comparison, ablation studies and visual analysis, on two benchmark datasets, RSICD and RSITMD.

图像-文本检索在近年来取得了快速发展，然而由于视觉-语义不平衡在遥感中仍然存在挑战，导致非语义视觉和文本特征的不正确匹配。为了解决这个问题，我们提出了一种新颖的面向方向的视觉-语义嵌入模型(DOVE)，用于挖掘视觉与语言之间的关系。具体而言，我们采用面向区域的注意力模块(ROAM)根据区域视觉特征自适应地调整潜在语义空间中最终的视觉和文本嵌入之间的距离。同时，我们设计了一个轻量级的文本基因助手(DTGA)，通过较少的注意力操作扩展可处理的文本表示范围并增强全局词级语义连接。最终，我们利用全局视觉-语义约束减少单一的视觉依赖，并作为最终视觉和文本表示的外部约束。我们通过包括参数评估、定量比较、消融研究和视觉分析在内的大量实验证明了我们方法的有效性和优越性，使用了两个基准数据集RSICD和RSITMD。

面向方向的遥感图像-文本检索视觉语义嵌入模型