We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/

本研究解决了视频自回归预训练模型的有效性问题，提出了一种名为Toto的模型系列，将视频视为视觉标记序列进行训练。研究结果表明，尽管模型具有较少的归纳偏见，预训练的自回归模型在多个下游任务中表现出色，显示出与语言模型相似的扩展曲线。

基于视频的自回归预训练实证研究