This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

该研究探讨了以特征预测作为无监督学习的独立目标，并介绍了V-JEPA，一个仅使用特征预测目标进行训练的视觉模型集合，无需预训练图像编码器、文本、负样本、重建或其他监督方式。我们的研究结果表明，通过预测视频特征进行学习可以得到性能良好的通用视觉表示，适用于运动和外观相关的任务，无需调整模型参数，如冻结骨干网络。我们最大的模型，即仅使用视频进行训练的ViT-H/16，在Kinetics-400上获得81.9%的准确率，在Something-Something-v2上获得72.2%的准确率，在ImageNet1K上获得77.9%的准确率。

重新思考基于视频学习视觉表示的特征预测