We introduce animated stickers, a video diffusion model which generates an animation conditioned on a text prompt and static sticker image. Our model is built on top of the state-of-the-art Emu text-to-image model, with the addition of temporal layers to model motion. Due to the domain gap, i.e. differences in visual and motion style, a model which performed well on generating natural videos can no longer generate vivid videos when applied to stickers. To bridge this gap, we employ a two-stage finetuning pipeline: first with weakly in-domain data, followed by human-in-the-loop (HITL) strategy which we term ensemble-of-teachers. It distills the best qualities of multiple teachers into a smaller student model. We show that this strategy allows us to specifically target improvements to motion quality while maintaining the style from the static image. With inference optimizations, our model is able to generate an eight-frame video with high-quality, interesting, and relevant motion in under one second.

我们引入了动画贴纸，一种根据文本提示和静态贴纸图像生成动画的扩散模型。我们的模型建立在最先进的Emu文本图像模型基础上，并通过添加时间层来模拟动作。为了弥合领域差异，即视觉和动作风格的差异，一个在生成自然视频方面表现良好的模型在应用于贴纸时无法生成生动的视频。为了弥合这一差距，我们采用了两阶段微调流程：首先使用弱域内数据，然后采用多教师集成策略，在人机协同策略下提取多个教师的最佳品质，进而生成较小的学生模型。我们展示了这种策略在保持静态图像风格的同时，针对动作质量的改进。通过推理优化，我们的模型能够在一秒钟内生成高质量、有趣且相关的八帧视频。

动态贴纸：通过视频扩散让贴纸栩栩如生