The recent advances in convolutional neural networks (CNNs) and Vision
Transformers have convincingly demonstrated high learning capability for video
action recognition on large datasets. Nevertheless, deep models often suffer
from the overfitting effect on small-scale datasets with a