The canonical approach to video-text retrieval leverages a coarse-grained or
fine-grained alignment between visual and textual information. However,
retrieving the correct video according to the text query is often challenging
as it requires the ability to reason about both high-level (scene) and
low-level (object) visual clues and how they relate to the text query. To this
end, we propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA.
Specifically, our model captures the cross-modal similarity information at
different granularity levels. To alleviate the effect of irrelevant visual
clues, we also apply an Interactive Similarity Aggregation module (ISA) to
consider the importance of different visual features while aggregating the
cross-modal similarity to obtain a similarity score for each granularity.
Finally, we apply the Sinkhorn-Knopp algorithm to normalize the similarities of
each level before summing them, alleviating over- and under-representation
issues at different levels. By jointly considering the crossmodal similarity of
different granularity, UCoFiA allows the effective unification of multi-grained
alignments. Empirically, UCoFiA outperforms previous state-of-the-art
CLIP-based methods on multiple video-text retrieval benchmarks, achieving 2.4%,
1.4% and 1.3% improvements in text-to-video retrieval R@1 on MSR-VTT,
Activity-Net, and DiDeMo, respectively. Our code is publicly available at
this https URL

通过联合考虑不同粒度的跨模态相似性，我们提出了一种统一的多粒度对齐模型 UCoFiA，显著优于以前的基于 CLIP 方法，在多个视频 - 文本检索基准上表现出了 2.4％，1.4％和 1.3％的文本到视频检索 R@1 改进。

视频文本检索的统一粗细对齐

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

Most existing methods in vision language pre-training rely on object-centric
features extracted through object detection and make fine-grained alignments
between the extracted features and texts. It is challenging for these methods
to learn relations among multiple objects. To this end, we propose a new method
called X-VLM to perform `multi-grained vision language pre-training.' The key
to learning multi-grained alignments is to locate visual concepts in the image
given the associated texts, and in the meantime align the texts with the visual
concepts, where the alignments are in multi-granularity. Experimental results
show that X-VLM effectively leverages the learned multi-grained alignments to
many downstream vision language tasks and consistently outperforms
state-of-the-art methods.

提出了一种名为 X-VLM 的多粒度视觉语言预训练方法，通过定位图像中的视觉概念并将其与文本进行对齐，实现了多粒度对齐，并将其应用于下游视觉语言任务中取得了优秀的效果，并超越了现有的最先进方法。