Exploring open-vocabulary video action recognition is a promising venture,
which aims to recognize previously unseen actions within any arbitrary set of
categories. Existing methods typically adapt pretrained image-text models to
the video domain, capitalizing on their inherent strengths in generalization. A
common thread among such methods is the augmentation of visual embeddings with
temporal information to improve the recognition of seen actions. Yet, they
compromise with standard less-informative action descriptions, thus faltering
when confronted with novel actions. Drawing inspiration from human cognitive
processes, we argue that augmenting text embeddings with human prior knowledge
is pivotal for open-vocabulary video action recognition. To realize this, we
innovatively blend video models with Large Language Models (LLMs) to devise
Action-conditioned Prompts. Specifically, we harness the knowledge in LLMs to
produce a set of descriptive sentences that contain distinctive features for
identifying given actions. Building upon this foundation, we further introduce
a multi-modal action knowledge alignment mechanism to align concepts in video
and textual knowledge encapsulated within the prompts. Extensive experiments on
various video benchmarks, including zero-shot, few-shot, and base-to-novel
generalization settings, demonstrate that our method not only sets new SOTA
performance but also possesses excellent interpretability.

通过创新地将视频模型与大型语言模型相结合，本研究采用基于行动条件的提示方法来增强文本嵌入的人类先验知识，从而在开放词汇视频动作识别中实现了新的最佳性能，并具有优越的可解释性。