Diffusion models have achieved significant success in image and video
generation. This motivates a growing interest in video editing tasks, where
videos are edited according to provided text descriptions. However, most
existing approaches only focus on video editing for short clips and rely on
time-consuming tuning or inference. We are the first to propose Video
Instruction Diffusion (VIDiff), a unified foundation model designed for a wide
range of video tasks. These tasks encompass both understanding tasks (such as
language-guided video object segmentation) and generative tasks (video editing
and enhancement). Our model can edit and translate the desired results within
seconds based on user instructions. Moreover, we design an iterative
auto-regressive method to ensure consistency in editing and enhancing long
videos. We provide convincing generative results for diverse input videos and
written instructions, both qualitatively and quantitatively. More examples can
be found at our website this https URL

我们提出了 Video Instruction Diffusion（VIDiff），这是一个统一的基础模型，专为广泛的视频任务设计，包括理解任务（如语言引导的视频对象分割）和生成任务（视频编辑和增强）。我们的模型可以根据用户指令在几秒钟内编辑和翻译所需的结果，并设计了一个迭代的自回归方法来确保对长视频的一致性编辑和增强。我们以定性和定量的方式提供了对各种输入视频和书面指令的令人信服的生成结果。