We propose a strong baseline model for unsupervised feature learning using video data. By learning to predict missing frames or extrapolate future frames from an input video sequence, the model discovers both spatial and temporal correlations which are useful to represent complex deformations and motion patterns. The models we propose are largely borrowed from the language modeling literature, and adapted to the vision domain by quantizing the space of image patches into a large dictionary. We demonstrate the approach on both a filling and a generation task. For the first time, we show that, after training on natural videos, such a model can predict non-trivial motions over short video sequences.

本文提出了一种用于无监督特征学习的视频数据的强基线模型，通过学习预测输入视频序列中缺少的帧或外推未来帧，该模型发现了对于表示复杂变形和运动模式有用的空间和时间相关性，并且是借鉴语言建模文献，通过将图像补丁的空间量化为一个大字典，适应了视觉领域。我们在填充和生成任务上演示了该方法。第一次，我们展示了在自然视频上训练后，这样一个模型可以预测短视频序列中的非平凡运动。

视频（语言）建模：自然视频生成模型的基线