In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However, videos inherently express a much wider gamut of information than texts. Instead, texts often capture sub-regions of entire videos and are most semantically similar to certain frames within videos. Therefore, for a given text, a retrieval model should focus on the text's most semantically similar video sub-regions to make a more relevant comparison. Yet, most existing works aggregate entire videos without directly considering text. Common text-agnostic aggregations schemes include mean-pooling or self-attention over the frames, but these are likely to encode misleading visual information not described in the given text. To address this, we propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video. Our core mechanism is a scaled dot product attention for a text to attend to its most semantically similar frames. We then generate an aggregated video representation conditioned on the text's attention weights over the frames. We evaluate our method on three benchmark datasets of MSR-VTT, MSVD and LSMDC, achieving new state-of-the-art results by up to 12% in relative improvement in Recall@1. Our findings thereby highlight the importance of joint text-video reasoning to extract important visual cues according to text. Full code and demo can be found at: https://layer6ai-labs.github.io/xpool/

提出了一种名为 X-Pool 的跨模态注意力模型，用于在文本和视频之间进行推理，从而提取重要的视觉线索。通过使用一个标度点乘的注意力机制，允许文本关注其最语义相似的帧，并生成基于文本的帧的注意力权重的聚合视频表示。在 MSR-VTT、MSVD 和 LSMDC 三个基准数据集上进行评估，实现了相对提高 Recall@1 高达 12% 的新的最佳效果。

跨媒体语言-视频注意力X-Pool在文本-视频检索中的应用