Recently, the rise of large-scale vision-language pretrained models like
CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has
captured substantial attraction in video action recognition. Nevertheless,
prevailing approaches tend to prioritize strong supervised performance at the
expense of compromising the models' generalization capabilities during
transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP
adapting framework named \name to address these challenges, preserving both
high supervised performance and robust transferability. Firstly, to enhance the
individual modality architectures, we introduce multimodal adapters to both the
visual and text branches. Specifically, we design a novel visual TED-Adapter,
that performs global Temporal Enhancement and local temporal Difference
modeling to improve the temporal representation capabilities of the visual
encoder. Moreover, we adopt text encoder adapters to strengthen the learning of
semantic label information. Secondly, we design a multi-task decoder with a
rich set of supervisory signals to adeptly satisfy the need for strong
supervised performance and generalization within a multimodal framework.
Experimental results validate the efficacy of our approach, demonstrating
exceptional performance in supervised learning while maintaining strong
generalization in zero-shot scenarios.

该研究介绍了一种名为 \name 的新型多模态、多任务 CLIP 自适应框架，通过引入多模态适配器和多任务解码器，实现强大的监督学习性能和在零样本场景中的强大泛化能力。