Applying large-scale pre-trained visual models like CLIP to few-shot action
recognition tasks can benefit performance and efficiency. Utilizing the
"pre-training, fine-tuning" paradigm makes it possible to avoid training a
network from scratch, which can be time-consuming and resource-