We present MOFA-Video, an advanced controllable image animation method that generates video from the given image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations. This is different from previous methods which only can work on a specific motion domain or show weak control abilities with diffusion prior. To achieve our goal, we design several domain-aware motion field adapters (\ie, MOFA-Adapters) to control the generated motions in the video generation pipeline. For MOFA-Adapters, we consider the temporal motion consistency of the video and generate the dense motion flow from the given sparse control conditions first, and then, the multi-scale features of the given image are wrapped as a guided feature for stable video diffusion generation. We naively train two motion adapters for the manual trajectories and the human landmarks individually since they both contain sparse information about the control. After training, the MOFA-Adapters in different domains can also work together for more controllable video generation.

MOFA-Video通过使用各种额外的可控信号（例如人类标志物参考、手动轨迹以及另一个提供的视频）或其组合，从给定的图像生成视频。MOFA-Video不同于先前只能在特定运动范围内工作或显示弱控制能力的方法，为了实现我们的目标，我们设计了几个领域感知的运动适配器（即MOFA-Adapters）来控制视频生成流程中生成的运动。对于MOFA-Adapters，我们首先考虑视频的时间运动一致性，并从给定的稀疏控制条件生成稠密运动流，然后将给定图像的多尺度特征包装为稳定视频扩散生成的引导特征。我们分别对手动轨迹和人类标志物进行了两个运动适配器的训练，因为它们都包含有关控制的稀疏信息。在训练之后，不同域中的MOFA-Adapters也可以一起工作以实现更可控的视频生成。

MOFA-Video: 冻结图像到视频扩散模型中的生成运动场适应的可控图像动画