This paper explores feature prediction as a stand-alone objective for
unsupervised learning from video and introduces V-JEPA, a collection of vision
models trained solely using a feature prediction objective, without the use of
pretrained image encoders, text, negative examples, reconstruction, or other
sources of supervision. The models are trained on 2 million videos collected
from public datasets and are evaluated on downstream image and video tasks. Our
results show that learning by predicting video features leads to versatile
visual representations that perform well on both motion and appearance-based
tasks, without adaption of the model's parameters; e.g., using a frozen
backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9%
on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

该研究探讨了以特征预测作为无监督学习的独立目标，并介绍了 V-JEPA，一个仅使用特征预测目标进行训练的视觉模型集合，无需预训练图像编码器、文本、负样本、重建或其他监督方式。我们的研究结果表明，通过预测视频特征进行学习可以得到性能良好的通用视觉表示，适用于运动和外观相关的任务，无需调整模型参数，如冻结骨干网络。我们最大的模型，即仅使用视频进行训练的 ViT-H/16，在 Kinetics-400 上获得 81.9% 的准确率，在 Something-Something-v2 上获得 72.2% 的准确率，在 ImageNet1K 上获得 77.9% 的准确率。