Video recognition has been dominated by the end-to-end learning paradigm --
first initializing a video recognition model with weights of a pretrained image
model and then conducting end-to-end training on videos. This enables the video
network to benefit from the pretrained image model. However, this requires
substantial computation and memory resources for finetuning on videos and the
alternative of directly using pretrained image features without finetuning the
image backbone leads to subpar results. Fortunately, recent advances in
Contrastive Vision-Language Pre-training (CLIP) pave the way for a new route
for visual recognition tasks. Pretrained on large open-vocabulary image-text
pair data, these models learn powerful visual representations with rich
semantics. In this paper, we present Efficient Video Learning (EVL) -- an
efficient framework for directly training high-quality video recognition models
with frozen CLIP features. Specifically, we employ a lightweight Transformer
decoder and learn a query token to dynamically collect frame-level spatial
features from the CLIP image encoder. Furthermore, we adopt a local temporal
module in each decoder layer to discover temporal clues from adjacent frames
and their attention maps. We show that despite being efficient to train with a
frozen backbone, our models learn high quality video representations on a
variety of video recognition datasets. Code is available at
this https URL

本文提出 Efficient Video Learning (EVL) 框架，使用轻量级 Transformer 解码器和学习查询标记以从 CLIP 图像编码器中动态收集帧级空间特征，进一步采用每个解码器层中的局部时间模块来发现相邻帧及其注意力映射中的时间线索。尽管使用了以前的预训练图像模型，本研究表明 EVL 模型在各种视频识别数据集上都学习了高质量的视频表示方法。