This report presents MagicAvatar, a framework for multimodal video generation
and animation of human avatars. Unlike most existing methods that generate
avatar-centric videos directly from multimodal inputs (e.g., text prompts),
MagicAvatar explicitly disentangles avatar video generation into two stages:
(1) multimodal-to-motion and (2) motion-to-video generation. The first stage
translates the multimodal inputs into motion/ control signals (e.g., human
pose, depth, DensePose); while the second stage generates avatar-centric video
guided by these motion signals. Additionally, MagicAvatar supports avatar
animation by simply providing a few images of the target person. This
capability enables the animation of the provided human identity according to
the specific motion derived from the first stage. We demonstrate the
flexibility of MagicAvatar through various applications, including text-guided
and video-guided avatar generation, as well as multimodal avatar animation.

MagicAvatar 是一种用于多模态视频生成和人体化身动画的框架，通过将动作明确分离为两个阶段：多模态到动作转换和动作到视频生成，在提供的人物图像的基础上能够实现简单的人物动画和根据第一阶段产生的具体动作实现特定身份的人物动画。

魔法化身：多模态化身生成与动画

MagicAvatar: Multimodal Avatar Generation and Animation

Most methods for conditional video synthesis use a single modality as the
condition. This comes with major limitations. For example, it is problematic
for a model conditioned on an image to generate a specific motion trajectory
desired by the user since there is no means to provide motion information.
Conversely, language information can describe the desired motion, while not
precisely defining the content of the video. This work presents a multimodal
video generation framework that benefits from text and images provided jointly
or separately. We leverage the recent progress in quantized representations for
videos and apply a bidirectional transformer with multiple modalities as inputs
to predict a discrete video representation. To improve video quality and
consistency, we propose a new video token trained with self-learning and an
improved mask-prediction algorithm for sampling video tokens. We introduce text
augmentation to improve the robustness of the textual representation and
diversity of generated videos. Our framework can incorporate various visual
modalities, such as segmentation masks, drawings, and partially occluded
images. It can generate much longer sequences than the one used for training.
In addition, our model can extract visual information as suggested by the text
prompt, e.g., "an object in image one is moving northeast", and generate
corresponding videos. We run evaluations on three public datasets and a newly
collected dataset labeled with facial attributes, achieving state-of-the-art
generation results on all four.

使用多模态生成框架，结合文本和图像训练双向变压器等多重输入来预测离散视频表示，同时提供改进的样本视频代币和文本增广，以及支持分割掩码、绘图和部分遮挡图像等不同视觉模态，可以通过文本提示生成对应视频，并在四个数据集上取得了最新的生成结果。