Pretrained vision-language models have shown effectiveness in video
understanding. However, recent studies have not sufficiently leveraged
essential temporal information from videos, simply averaging frame-wise
representations or referencing consecutive frames. We introduce Temporally
Contextualized CLIP (TC-CLIP), a pioneering framework for video understanding
that effectively and efficiently leverages comprehensive video information. We
propose Temporal Contextualization (TC), a novel layer-wise temporal
information infusion mechanism for video that extracts core information from
each frame, interconnects relevant information across the video to summarize
into context tokens, and ultimately leverages the context tokens during the
feature encoding process. Furthermore, our Video-conditional Prompting (VP)
module manufactures context tokens to generate informative prompts in text
modality. We conduct extensive experiments in zero-shot, few-shot,
base-to-novel, and fully-supervised action recognition to validate the
superiority of our TC-CLIP. Ablation studies for TC and VP guarantee our design
choices. Code is available at this https URL

TC-CLIP 是一种改进的视觉语言模型，通过引入时间上下文信息和制造上下文令牌来实现视频理解和行为识别的效果提升。