Video-language pre-trained models have shown remarkable success in guiding
video question-answering (VideoQA) tasks. However, due to the length of video
sequences, training large-scale video-based models incurs considerably higher
costs than training image-based ones. This motivates us to leverage the
knowledge from image-based pretraining, despite the obvious gaps between image
and video domains. To bridge these gaps, in this paper, we propose Tem-Adapter,
which enables the learning of temporal dynamics and complex semantics by a
visual Temporal Aligner and a textual Semantic Aligner. Unlike conventional
pretrained knowledge adaptation methods that only concentrate on the downstream
task objective, the Temporal Aligner introduces an extra language-guided
autoregressive task aimed at facilitating the learning of temporal
dependencies, with the objective of predicting future states based on
historical clues and language guidance that describes event progression.
Besides, to reduce the semantic gap and adapt the textual representation for
better event description, we introduce a Semantic Aligner that first designs a
template to fuse question and answer pairs as event descriptions and then
learns a Transformer decoder with the whole video sequence as guidance for
refinement. We evaluate Tem-Adapter and different pre-train transferring
methods on two VideoQA benchmarks, and the significant performance improvement
demonstrates the effectiveness of our method.

通过引入 Tem-Adapter，结合视觉时间对齐器和文本语义对齐器，利用图像预训练的知识来弥合图像和视频领域之间的差距，以实现学习时间动力学和复杂语义的目的，并通过两个 VideoQA 基准测试验证了该方法的有效性。

Tem-adapter: 图像文本预训练方法用于视频问答

Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer

Recent large-scale video-language pre-trained models have shown appealing
performance on various downstream tasks. However, the pre-training process is
computationally expensive due to the requirement of millions of video-text
pairs and the redundant data structure of each video. To mitigate these
problems, we propose LiteVL, which adapts a pre-trained image-language model
BLIP into a video-text model directly on downstream tasks, without heavy
pre-training. To enhance the temporal modeling lacking in the image-language
model, we propose to add temporal attention modules in the image encoder of
BLIP with dynamic temporal scaling. Besides the model-wise adaptation, we also
propose a non-parametric pooling mechanism to adaptively reweight the
fine-grained video embedding conditioned on the text. Experimental results on
text-video retrieval and video question answering show that the proposed LiteVL
even outperforms previous video-language pre-trained models by a clear margin,
though without any video-language pre-training.

本文提出 LiteVL 模型，结合 BLIP 图像语言模型，通过使用动态时态缩放，给图像编码器添加时间注意力模块，并提出非参数池化机制，以自适应地重新加权文本条件下的细粒度视频嵌入，取得了良好的性能，即使没有进行任何视频 - 语言预训练。