Cross-modal learning of video and text plays a key role in Video Question
Answering (VideoQA). In this paper, we propose a visual-text attention
mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained
on lots of general domain language-image pairs to guide the cross-modal
learning for VideoQA. Specifically, we first extract video features using a
TimeSformer and text features using a BERT from the target application domain,
and utilize CLIP to extract a pair of visual-text features from the
general-knowledge domain through the domain-specific learning. We then propose
a Cross-domain Learning to extract the attention information between visual and
linguistic features across the target domain and general domain. The set of
CLIP-guided visual-text features are integrated to predict the answer. The
proposed method is evaluated on MSVD-QA and MSRVTT-QA datasets, and outperforms
state-of-the-art methods.

本文提出了一种利用 Contrastive Language-Image Pre-training（CLIP）作为跨模态学习指导的 Visual-Text Attention 机制来应用于视频问答任务。在特定领域中提取视频和文本特征后，利用 CLIP 对一组通用知识域上视觉 - 文本特征进行特征提取，并提出了交叉域学习来提取目标域和通用域间的视觉和语言特征之间的注意力信息，将特征集成用于迁移学习，结果表明这种方法优于现有的最先进方法。