We introduce Vidu, a high-performance text-to-video generator that is capable
of producing 1080p videos up to 16 seconds in a single generation. Vidu is a
diffusion model with U-ViT as its backbone, which unlocks the scalability and
the capability for handling long videos. Vidu exhibits strong coherence and
dynamism, and is capable of generating both realistic and imaginative videos,
as well as understanding some professional photography techniques, on par with
Sora -- the most powerful reported text-to-video generator. Finally, we perform
initial experiments on other controllable video generation, including
canny-to-video generation, video prediction and subject-driven generation,
which demonstrate promising results.

Vidu 是一种高性能的文本到视频生成器，采用 U-ViT 作为骨干模型，可以单次生成长达 16 秒的 1080p 视频。Vidu 具有较强的连贯性和动态性，能够生成逼真和富有想象力的视频，同时，在某些专业摄影技术方面具备与 Sora 相媲美的水平。最后，我们还进行了其他可控制的视频生成初步实验，包括 Canny-to-video 生成、视频预测和主题驱动生成，结果显示有希望。

Vidu: 一种高一致性、动态且技术娴熟的文本到视频生成器，采用扩散模型

Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator  with Diffusion Models

This paper strives for image editing via generative models. Flow Matching is
an emerging generative modeling technique that offers the advantage of simple
and efficient training. Simultaneously, a new transformer-based U-ViT has
recently been proposed to replace the commonly used UNet for better scalability
and performance in generative modeling. Hence, Flow Matching with a transformer
backbone offers the potential for scalable and high-quality generative
modeling, but their latent structure and editing ability are as of yet unknown.
Hence, we adopt this setting and explore how to edit images through latent
space manipulation. We introduce an editing space, which we call $u$-space,
that can be manipulated in a controllable, accumulative, and composable manner.
Additionally, we propose a tailored sampling solution to enable sampling with
the more efficient adaptive step-size ODE solvers. Lastly, we put forth a
straightforward yet powerful method for achieving fine-grained and nuanced
editing using text prompts. Our framework is simple and efficient, all while
being highly effective at editing images while preserving the essence of the
original content. Our code will be publicly available at this https URL

通过流匹配和变压器（U-ViT）的结合，实现了简单高效的图像编辑方法，大大提高了生成模型的可扩展性和性能，并提供了对潜在结构和编辑能力的探索。通过引入一个名为 $u$-space 的编辑空间以及针对 ODE 求解器的适应性抽样解决方案，实现了可控、累积和可组合的图像编辑。此外，使用文本提示的简洁而强大的方法实现细粒度和细致的图像编辑。这个框架既简单高效，同时又能保留图像原始内容的本质。