3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods, the results generated by our method are consistent, and have favorable visual quality (-30% FID, -37% KID).

本文提出一种新的方法，利用预训练的文字转图像模型作为先验知识，从真实世界数据中的单个去噪过程中生成多视角图像，并且通过在现有 U-Net 网络的每个块中整合 3D 体渲染和跨帧注意力层，设计出自回归生成方法，在任意视点上呈现更具一致性的 3D 图像。与现有方法相比，我们的方法生成的结果是一致的，并且具有优秀的视觉质量（FID降低30%，KID降低37%）。

ViewDiff：利用文本到图像模型的3D一致图像生成