Video-Text pre-training aims at learning transferable representations from
large-scale video-text pairs via aligning the semantics between visual and
textual information. State-of-the-art approaches extract visual features from
raw pixels in an end-to-end fashion. However, these methods operate at
frame-level directly and thus overlook the spatio-temporal structure of objects
in video, which yet has a strong synergy with nouns in textual descriptions. In
this work, we propose a simple yet effective module for video-text
representation learning, namely RegionLearner, which can take into account the
structure of objects during pre-training on large-scale video-text pairs. Given
a video, our module (1) first quantizes visual features into semantic clusters,
then (2) generates learnable masks and uses them to aggregate the features
belonging to the same semantic region, and finally (3) models the interactions
between different aggregated regions. In contrast to using off-the-shelf object
detectors, our proposed module does not require explicit supervision and is
much more computationally efficient. We pre-train the proposed approach on the
public WebVid2M and CC3M datasets. Extensive evaluations on four downstream
video-text retrieval benchmarks clearly demonstrate the effectiveness of our
RegionLearner. The code will be available at
this https URL

本研究提出一种新的视频文本表示学习模块（RegionLearner），能够在大规模视频文本对的预训练中考虑对象结构，并通过语义群聚来合并视觉特征，最终通过不同聚合区域之间的交互来建模，从而促进视频文本检索的效果。