This paper presents a unified multimodal pre-trained model called N\"UWA that can generate new or manipulate existing visual data (i.e., images and videos) for various visual synthesis tasks. To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and reduce the computational complexity. We evaluate N\"UWA on 8 downstream tasks. Compared to several strong baselines, N\"UWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is https://github.com/microsoft/NUWA.

本文提出了一种名为N'UWA的统一的多模态预训练模型，可以生成新的或调节现有的视觉数据（即图像和视频）以进行各种视觉合成任务。N'UWA在8个下游任务上的表现超过了强基准，并且在文本到图像生成，文本到视频生成，视频预测等任务上实现了最先进的结果。此外，它还表现出惊人的零成本能力，用于文本引导的图像和视频操作任务。

NÜWA: 神经视觉世界创造的视觉综合预训练