This paper demonstrates a self-supervised approach for learning semantic video representations. Recent vision studies show that a masking strategy for vision and natural language supervision has contributed to developing transferable visual pretraining. Our goal is to achieve a more semantic video representation by leveraging the text related to the video content during the pretraining in a fully self-supervised manner. To this end, we present FILS, a novel self-supervised video Feature prediction In semantic Language Space (FILS). The vision model can capture valuable structured information by correctly predicting masked feature semantics in language space. It is learned using a patch-wise video-text contrastive strategy, in which the text representations act as prototypes for transforming vision features into a language space, which are then used as targets for semantically meaningful feature prediction using our masked encoder-decoder structure. FILS demonstrates remarkable transferability on downstream action recognition tasks, achieving state-of-the-art on challenging egocentric datasets, like Epic-Kitchens, Something-SomethingV2, Charades-Ego, and EGTEA, using ViT-Base. Our efficient method requires less computation and smaller batches compared to previous works.

利用全面的自我监督方式，在语义语言空间中通过预测视频特征的掩码语义来实现更具语义性的视频表示，该方法在下游动作识别任务中具有显著的传递性能，并在诸如Epic-Kitchens、Something-SomethingV2、Charades-Ego和EGTEA等具有挑战性的自我中心数据集上使用ViT-Base取得了最新的性能表现。

FILS：自我监督的语义语言空间内视频特征预测