We introduce $\textit{InteractiveVideo}$, a user-centric framework for video
generation. Different from traditional generative approaches that operate based
on user-provided images or text, our framework is designed for dynamic
interaction, allowing users to instruct the generative model through various
intuitive mechanisms during the whole generation process, e.g. text and image
prompts, painting, drag-and-drop, etc. We propose a Synergistic Multimodal
Instruction mechanism, designed to seamlessly integrate users' multimodal
instructions into generative models, thus facilitating a cooperative and
responsive interaction between user inputs and the generative process. This
approach enables iterative and fine-grained refinement of the generation result
through precise and effective user instructions. With
$\textit{InteractiveVideo}$, users are given the flexibility to meticulously
tailor key aspects of a video. They can paint the reference image, edit
semantics, and adjust video motions until their requirements are fully met.
Code, models, and demo are available at
this https URL

我们介绍了 InteractiveVideo，这是一个以用户为中心的视频生成框架，通过动态交互允许用户通过各种直观的机制在整个生成过程中指导生成模型，例如文本，图片提示，绘画，拖放等。我们提出了一种协同多模态指导机制，旨在将用户的多模态指令无缝集成到生成模型中，从而促进用户输入和生成过程之间的合作和响应交互，使生成结果能够通过精确而有效的用户指令进行迭代和细粒度的改进。通过 InteractiveVideo，用户可以详细定制视频的关键方面，如绘制参考图像、编辑语义和调整视频动作，直到满足其需求。此外还提供代码、模型和演示的链接。

交互式视频：以用户为中心的可控视频生成与多模态协同指导

InteractiveVideo: User-Centric Controllable Video Generation with  Synergistic Multimodal Instructions

Large language models with instruction-following abilities have
revolutionized the field of artificial intelligence. These models show
exceptional generalizability to tackle various real-world tasks through their
natural language interfaces. However, their performance heavily relies on
high-quality exemplar data, which is often difficult to obtain. This challenge
is further exacerbated when it comes to multimodal instruction following. We
introduce TextBind, an almost annotation-free framework for empowering larger
language models with the multi-turn interleaved multimodal
instruction-following capabilities. Our approach requires only image-caption
pairs and generates multi-turn multimodal instruction-response conversations
from a language model. We release our dataset, model, and demo to foster future
research in the area of multimodal instruction following.

介绍了 TextBind，这是一个几乎无需标注的框架，可以为较大的语言模型赋予多轮交错的多模态指令跟随能力，通过仅使用图像 - 标题对生成多轮多模态指令 - 回应对话，从而旨在推动多模态指令跟随领域的未来研究。