Image-to-video generation, which aims to generate a video starting from a given reference image, has drawn great attention. Existing methods try to extend pre-trained text-guided image diffusion models to image-guided video generation models. Nevertheless, these methods often result in either low fidelity or flickering over time due to their limitation to shallow image guidance and poor temporal consistency. To tackle these problems, we propose a high-fidelity image-to-video generation method by devising a frame retention branch on the basis of a pre-trained video diffusion model, named DreamVideo. Instead of integrating the reference image into the diffusion process in a semantic level, our DreamVideo perceives the reference image via convolution layers and concatenate the features with the noisy latents as model input. By this means, the details of the reference image can be preserved to the greatest extent. In addition, by incorporating double-condition classifier-free guidance, a single image can be directed to videos of different actions by providing varying prompt texts. This has significant implications for controllable video generation and holds broad application prospects. We conduct comprehensive experiments on the public dataset, both quantitative and qualitative results indicate that our method outperforms the state-of-the-art method. Especially for fidelity, our model has powerful image retention ability and result in high FVD in UCF101 compared to other image-to-video models. Also, precise control can be achieved by giving different text prompts. Further details and comprehensive results of our model will be presented in https://anonymous0769.github.io/DreamVideo/.

我们提出了一种高保真度的图像到视频生成方法，通过在预先训练的视频扩散模型上设计一个帧保留分支，名为DreamVideo，来解决现有方法的局限性，该方法通过卷积层感知参考图像，并将特征与噪声潜在变量连接起来作为模型输入。同时，通过结合无分类器指导的双条件，可以通过提供不同的提示文本将单个图像导向不同动作的视频，使得视频的生成具备精确控制能力。综合实验表明，我们的方法在公开数据集上表现出色，无论是定量还是定性结果都优于现有方法，并且在UCF101数据集上相对于其他图像到视频模型具有较强的图像保留能力和高FVD得分。更多详细信息和全面结果将在文中进行详细阐述。

DreamVideo: 高保真图像到视频生成（具备图像保留和文本指导）