State-of-the-art text-video retrieval (TVR) methods typically utilize CLIP
and cosine similarity for efficient retrieval. Meanwhile, cross attention
methods, which employ a transformer decoder to compute attention between each
text query and all frames in a video, offer a more comprehensive interaction
between text and videos. However, these methods lack important fine-grained
spatial information as they directly compute attention between text and
video-level tokens. To address this issue, we propose CrossTVR, a two-stage
text-video retrieval architecture. In the first stage, we leverage existing TVR
methods with cosine similarity network for efficient text/video candidate
selection. In the second stage, we propose a novel decoupled video text cross
attention module to capture fine-grained multimodal information in spatial and
temporal dimensions. Additionally, we employ the frozen CLIP model strategy in
fine-grained retrieval, enabling scalability to larger pre-trained vision
models like ViT-G, resulting in improved retrieval performance. Experiments on
text video retrieval datasets demonstrate the effectiveness and scalability of
our proposed CrossTVR compared to state-of-the-art approaches.

提出了 CrossTVR，一个两阶段的文本视频检索架构。第一阶段利用现有的文本 - 视频检索方法进行候选选择，第二阶段提出了一个新颖的解耦视频文本交叉注意力模块，以捕捉时空维度中细粒度的多模态信息。通过在细粒度检索中采用冻结的 CLIP 模型策略，可以扩展到更大的预训练视觉模型，如 ViT-G，从而提高检索性能。对文本视频检索数据集进行的实验证明了我们提出的 CrossTVR 相较于最先进的方法的有效性和可扩展性。