In recent years, the task of video prediction-forecasting future video given
past video frames-has attracted attention in the research community. In this
paper we propose a novel approach to this problem with Vector Quantized
Variational AutoEncoders (VQ-VAE). With VQ-VAE we compress high-resolution
videos into a hierarchical set of multi-scale discrete latent variables.
Compared to pixels, this compressed latent space has dramatically reduced
dimensionality, allowing us to apply scalable autoregressive generative models
to predict video. In contrast to previous work that has largely emphasized
highly constrained datasets, we focus on very diverse, large-scale datasets
such as Kinetics-600. We predict video at a higher resolution on unconstrained
videos, 256x256, than any other previous method to our knowledge. We further
validate our approach against prior work via a crowdsourced human evaluation.

本文提出了一种基于 VQ-VAE 的视频预测方法，将高分辨率视频压缩为一组分层多尺度离散潜在变量，然后应用可扩展自回归生成模型，相对于先前的工作，更关注大规模多样化的数据集，并使用人工评估验证了其效果。

利用 VQVAE 模型预测视频

Predicting Video with VQVAE

We present a self-supervised Contrastive Video Representation Learning (CVRL)
method to learn spatiotemporal visual representations from unlabeled videos.
Our representations are learned using a contrastive loss, where two augmented
clips from the same short video are pulled together in the embedding space,
while clips from different videos are pushed away. We study what makes for good
data augmentations for video self-supervised learning and find that both
spatial and temporal information are crucial. We carefully design data
augmentations involving spatial and temporal cues. Concretely, we propose a
temporally consistent spatial augmentation method to impose strong spatial
augmentations on each frame of the video while maintaining the temporal
consistency across frames. We also propose a sampling-based temporal
augmentation method to avoid overly enforcing invariance on clips that are
distant in time. On Kinetics-600, a linear classifier trained on the
representations learned by CVRL achieves 70.4% top-1 accuracy with a
3D-ResNet-50 (R3D-50) backbone, outperforming ImageNet supervised pre-training
by 15.7% and SimCLR unsupervised pre-training by 18.8% using the same inflated
R3D-50. The performance of CVRL can be further improved to 72.9% with a larger
R3D-152 (2x filters) backbone, significantly closing the gap between
unsupervised and supervised video representation learning. Our code and models
will be available at
this https URL

本文引入了一种基于对比损失的自监督对比视频表示学习方法，利用在嵌入空间中相同短视频的两个增强剪辑进行学习，同时将来自不同视频的剪辑分开。这种自我监督学习方法需要好的数据增强和虚拟时间和模拟空间的知识，在 Kinetics-600 数据集上，该方法可以超过 ImageNet 和 SimCLR 的性能，达到 70.4% 的 top-1 准确率