We propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Our approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without training. This is achieved by iteratively performing diagonal denoising, which concurrently processes a series of consecutive frames with increasing noise levels in a queue; our method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising is a double-edged sword as the frames near the tail can take advantage of cleaner ones by forward reference but such a strategy induces the discrepancy between training and inference. Hence, we introduce latent partitioning to reduce the training-inference gap and lookahead denoising to leverage the benefit of forward referencing. We have demonstrated the promising results and effectiveness of the proposed methods on existing text-to-video generation baselines.

我们提出了一种基于预训练扩散模型的新推理技术，用于文本条件视频生成。我们的方法名为FIFO-Diffusion，能够概念性地生成无需训练的无限长视频。通过迭代地进行对角去噪处理，我们同时处理一个队列中噪声逐渐增加的一系列连续帧；我们的方法在队列头部出队完全去噪的帧，并在队列尾部入队一个新的随机噪声帧。但是，对角去噪处理是一把双刃剑，因为靠近尾部的帧可以通过向前引用来利用更干净的帧，但这种策略会引起训练和推理之间的差异。因此，我们引入了潜在分区来减小训练和推理之间的差距，并引入了前瞻去噪来利用向前引用的好处。我们已经在现有的文本到视频生成基准上展示了该方法的有希望的结果和有效性。

FIFO-Diffusion: 从文本生成无需训练的无限视频