This study introduces an efficient and effective method, MeDM, that utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow. The proposed framework can render videos from scene position information, such as a normal G-buffer, or perform text-guided editing on videos captured in real-world scenarios. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores. By leveraging this coding, maintaining temporal consistency in the generated videos can be framed as an optimization problem with a closed-form solution. To ensure compatibility with Stable Diffusion, we also suggest a workaround for modifying observed-space scores in latent-space Diffusion Models. Notably, MeDM does not require fine-tuning or test-time optimization of the Diffusion Models. Through extensive qualitative, quantitative, and subjective experiments on various benchmarks, the study demonstrates the effectiveness and superiority of the proposed approach.

本研究介绍了一种高效有效的方法，MeDM，利用预训练的图像扩散模型进行视频到视频的翻译，保持一致的时间流。该提出的框架可以从场景位置信息（如常规G缓冲区）渲染视频，或对在真实场景中捕获的视频进行文本引导编辑。我们采用显式光流构建了一种实用编码方式，对生成的帧施加物理约束并调节独立的逐帧评分。通过利用这种编码，确保生成的视频在时间上保持一致可以被看作是一个具有闭合形式解的优化问题。为确保与稳定扩散的兼容性，我们还提出了一种方法修改潜在空间扩散模型中的观察空间评分。值得注意的是，MeDM不需要对扩散模型进行微调或测试时优化。通过对各种基准测试进行广泛的定性、定量和主观实验证明了该方法的有效性和优越性。

MeDM：基于时间对应指导的图像扩散模型的视频到视频翻译