We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

Lumiere是一种文本到视频扩散模型，用于综合描绘逼真、多样和连贯的运动，在视频合成中是一个关键的挑战。通过引入空时U-Net架构，我们一次性地生成整个视频的时间持续性，与现有的合成关键帧和时间超分辨率的视频模型相比，我们的设计更容易实现全局时间一致性。我们展示了最先进的文本到视频生成结果，并表明我们的设计能够轻松支持广泛的内容创作任务和视频编辑应用，包括图像到视频、视频修复和风格生成。

Lumiere: 一个用于视频生成的时空扩散模型