Transferring vision-language knowledge from pretrained multimodal foundation
models to various downstream tasks is a promising direction. However, most
current few-shot action recognition methods are still limited to a single
visual modality input due to the high cost of annotating additional textual
descriptions. In this paper, we develop an effective plug-and-play framework
called CapFSAR to exploit the knowledge of multimodal models without manually
annotating text. To be specific, we first utilize a captioning foundation model
(i.e., BLIP) to extract visual features and automatically generate associated
captions for input videos. Then, we apply a text encoder to the synthetic
captions to obtain representative text embeddings. Finally, a visual-text
aggregation module based on Transformer is further designed to incorporate
cross-modal spatio-temporal complementary information for reliable few-shot
matching. In this way, CapFSAR can benefit from powerful multimodal knowledge
of pretrained foundation models, yielding more comprehensive classification in
the low-shot regime. Extensive experiments on multiple standard few-shot
benchmarks demonstrate that the proposed CapFSAR performs favorably against
existing methods and achieves state-of-the-art performance. The code will be
made publicly available.

通过 CapFSAR 框架，我们利用预训练的多模态基础模型的知识，从合成描述中提取视觉特征和相关文本嵌入，并设计了基于 Transformer 的视觉文本聚合模块，以在低样本情况下实现更全面的分类。在多个标准的少样本基准实验中，我们的 CapFSAR 方法表现优于现有方法，并达到了最先进的性能。