Text-guided video prediction (TVP) involves predicting the motion of future
frames from the initial frame according to an instruction, which has wide
applications in virtual reality, robotics, and content creation. Previous TVP
methods make significant breakthroughs by adapting Stable Diffusion for this
task. However, they struggle with frame consistency and temporal stability
primarily due to the limited scale of video datasets. We observe that
pretrained Image2Video diffusion models possess good priors for video dynamics
but they lack textual control. Hence, transferring Image2Video models to
leverage their video dynamic priors while injecting instruction control to
generate controllable videos is both a meaningful and challenging task. To
achieve this, we introduce the Multi-Modal Large Language Model (MLLM) to
predict future video states based on initial frames and text instructions. More
specifically, we design a dual query transformer (DQFormer) architecture, which
integrates the instructions and frames into the conditional embeddings for
future frame prediction. Additionally, we develop Long-Short Term Temporal
Adapters and Spatial Adapters that can quickly transfer general video diffusion
models to specific scenarios with minimal training costs. Experimental results
show that our method significantly outperforms state-of-the-art techniques on
four datasets: Something Something V2, Epic Kitchen-100, Bridge Data, and
UCF-101. Notably, AID achieves 91.2% and 55.5% FVD improvements on Bridge and
SSv2 respectively, demonstrating its effectiveness in various domains. More
examples can be found at our website this https URL

基于文本和初始帧，我们引入多模态大型语言模型 (MLLM) 来预测未来的视频状态。通过设计双查询 Transformer (DQFormer) 架构，并利用长短期时间适配器和空间适配器来快速转换通用视频扩散模型，我们的方法在四个数据集上明显优于现有技术，证明了其在不同领域的有效性。