Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a $10\times$ smaller model using significantly less computation than the prior art.

本文提出了一种新的视频综合方法，它使用预训练模型，并使用经过精心设计的视频噪声先验来生成高质量，时域一致的序列帧，获得了在 UCF-101 和 MSR-VTT 基准测试上 SOTA 的无需训练文本到视频结果。同时，在较小的 UCF-101 基准测试中使用更少的计算资源， $10	imes$更小的模型，达到了SOTA的视频生成质量。

保留自身关联性：一种视频扩散模型的噪声先验