Large-scale image-text contrastive pre-training models, such as CLIP, have
been demonstrated to effectively learn high-quality multimodal representations.
However, there is limited research on learning video-text representations for
general video multimodal tasks based on these powerful features. Towards this
goal, we propose a novel video-text pre-training method dubbed VLAB: Video
Language pre-training by feature Adapting and Blending, which transfers CLIP
representations to video pre-training tasks and develops unified video
multimodal models for a wide range of video-text tasks. Specifically, VLAB is
founded on two key strategies: feature adapting and feature blending. In the
former, we introduce a new video adapter module to address CLIP's deficiency in
modeling temporal information and extend the model's capability to encompass
both contrastive and generative tasks. In the latter, we propose an end-to-end
training method that further enhances the model's performance by exploiting the
complementarity of image and video features. We validate the effectiveness and
versatility of VLAB through extensive experiments on highly competitive video
multimodal tasks, including video text retrieval, video captioning, and video
question answering. Remarkably, VLAB outperforms competing methods
significantly and sets new records in video question answering on MSRVTT, MSVD,
and TGIF datasets. It achieves an accuracy of 49.6, 61.0, and 79.0,
respectively. Codes and models will be released.

本文提出了一种名为 VLAB 的新型视频 - 文本预训练方法，通过特征适应和融合扩展了 CLIP 的能力并构建统一的视频多模态模型，验证了其在视频文本检索、视频字幕生成和视频问答等高竞争任务中的有效性和多功能性。

VLAB: 通过特征调整和混合增强视频语言预训练

VLAB: Enhancing Video Language Pre-training by Feature Adapting and  Blending

Video-Text Pre-training (VTP) aims to learn transferable representations for
various downstream tasks from large-scale web videos. To date, almost all
existing VTP methods are limited to retrieval-based downstream tasks, e.g.,
video retrieval, whereas their transfer potentials on localization-based tasks,
e.g., temporal grounding, are under-explored. In this paper, we experimentally
analyze and demonstrate the incompatibility of current VTP methods with
localization tasks, and propose a novel Localization-oriented Video-Text
Pre-training framework, dubbed as LocVTP. Specifically, we perform the
fine-grained contrastive alignment as a complement to the coarse-grained one by
a clip-word correspondence discovery scheme. To further enhance the temporal
reasoning ability of the learned feature, we propose a context projection head
and a temporal aware contrastive loss to perceive the contextual relationships.
Extensive experiments on four downstream tasks across six datasets demonstrate
that our LocVTP achieves state-of-the-art performance on both retrieval-based
and localization-based tasks. Furthermore, we conduct comprehensive ablation
studies and thorough analyses to explore the optimum model designs and training
strategies.

本文提出了一种面向本地化任务的视频文本预训练框架 LocVTP，通过精细对比对齐和上下文感知等机制，提高了其学到的特征的时空推理能力和传递性，实现了在四个下游任务上的最优表现。

LocVTP: 视频文本预训练用于时间定位

LocVTP: Video-Text Pre-training for Temporal Localization

Video-Text pre-training aims at learning transferable representations from
large-scale video-text pairs via aligning the semantics between visual and
textual information. State-of-the-art approaches extract visual features from
raw pixels in an end-to-end fashion. However, these methods operate at
frame-level directly and thus overlook the spatio-temporal structure of objects
in video, which yet has a strong synergy with nouns in textual descriptions. In
this work, we propose a simple yet effective module for video-text
representation learning, namely RegionLearner, which can take into account the
structure of objects during pre-training on large-scale video-text pairs. Given
a video, our module (1) first quantizes visual features into semantic clusters,
then (2) generates learnable masks and uses them to aggregate the features
belonging to the same semantic region, and finally (3) models the interactions
between different aggregated regions. In contrast to using off-the-shelf object
detectors, our proposed module does not require explicit supervision and is
much more computationally efficient. We pre-train the proposed approach on the
public WebVid2M and CC3M datasets. Extensive evaluations on four downstream
video-text retrieval benchmarks clearly demonstrate the effectiveness of our
RegionLearner. The code will be available at
this https URL

本研究提出一种新的视频文本表示学习模块（RegionLearner），能够在大规模视频文本对的预训练中考虑对象结构，并通过语义群聚来合并视觉特征，最终通过不同聚合区域之间的交互来建模，从而促进视频文本检索的效果。