Vision-language models (VLMs) pre-trained on large-scale image-text pairs
have demonstrated impressive transferability on various visual tasks.
Transferring knowledge from such powerful VLMs is a promising direction for
building effective video recognition models. However, current exploration in
this field is still limited. We believe that the greatest value of pre-trained
VLMs lies in building a bridge between visual and textual domains. In this
paper, we propose a novel framework called BIKE, which utilizes the cross-modal
bridge to explore bidirectional knowledge: i) We introduce the Video Attribute
Association mechanism, which leverages the Video-to-Text knowledge to generate
textual auxiliary attributes for complementing video recognition. ii) We also
present a Temporal Concept Spotting mechanism that uses the Text-to-Video
expertise to capture temporal saliency in a parameter-free manner, leading to
enhanced video representation. Extensive studies on six popular video datasets,
including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show
that our method achieves state-of-the-art performance in various recognition
scenarios, such as general, zero-shot, and few-shot video recognition. Our best
model achieves a state-of-the-art accuracy of 88.6% on the challenging
Kinetics-400 using the released CLIP model. The code is available at
this https URL .

本文介绍了一个名叫 BIKE 的，通过使用视频和文本之间的跨模态桥梁，通过视频设置自动补充的文字辅助属性，和通过文本确定带有时间明显性的位置，以增强视频表示，从而有效提高各种识别情景下的视频识别性能的创新框架。 在六个流行的视频数据集中进行的广泛研究表明，我们的方法在各种识别方案中均实现了最先进的性能。