Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.

本研究解决了从视频进行纯自监督学习的扩展性问题，重点评估非语义视觉任务（如相机姿态估计、点和物体跟踪、深度估计）的自监督学习效果。通过从非常大的视频数据集中学习，本文展示了使用变换器视频模型的掩蔽自编码（MAE）能够有效扩展，从而在4D任务上随着模型规模的增加显著提高性能。

扩展4D表示