Image-based visual-language (I-VL) pre-training has shown great success for
learning joint visual-textual representations from large-scale web data,
revealing remarkable ability for zero-shot generalisation. This paper presents
a simple but strong baseline to efficiently adapt the pre-trained I-VL model,
and exploit its powerful ability for resource-hungry video understanding tasks,
with minimal training. Specifically, we propose to optimise a few random
vectors, termed as continuous prompt vectors, that convert video-related tasks
into the same format as the pre-training objectives. In addition, to bridge the
gap between static images and videos, temporal information is encoded with
lightweight Transformers stacking on top of frame-wise visual features.
Experimentally, we conduct extensive ablation studies to analyse the critical
components. On 10 public benchmarks of action recognition, action localisation,
and text-video retrieval, across closed-set, few-shot, and zero-shot scenarios,
we achieve competitive or state-of-the-art performance to existing methods,
despite optimising significantly fewer parameters.

本研究提出了一种简单但强大的基准线以有效地适应预训练的 I-VL 模型，并利用其强大的资源 - hungry 视频理解任务的能力进行最小化的训练，通过几个随机向量连续提示向量进行优化，将视频相关任务转化为与预训练目标相同的格式。对于行动识别，动作定位和文本 - 视频检索的 10 个公共基准测试，尽管优化参数显著减少，但在封闭集，少量样本和零样本情况下，我们实现了与现有方法相当或最新的性能，实验上进行了广泛的消融研究以分析关键组件以与静态图像和视频之间的差距。