We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets (represented by Neural Radiance Fields) from text prompts. Unlike recent 3D generative models that rely on clean and well-aligned 3D data, limiting them to single or few-class generation, our model is directly trained on extensive noisy and unaligned `in-the-wild' 3D assets, mitigating the key challenge (i.e., data scarcity) in large-scale 3D generation. In particular, DIRECT-3D is a tri-plane diffusion model that integrates two innovations: 1) A novel learning framework where noisy data are filtered and aligned automatically during the training process. Specifically, after an initial warm-up phase using a small set of clean data, an iterative optimization is introduced in the diffusion process to explicitly estimate the 3D pose of objects and select beneficial data based on conditional density. 2) An efficient 3D representation that is achieved by disentangling object geometry and color features with two separate conditional diffusion models that are optimized hierarchically. Given a prompt input, our model generates high-quality, high-resolution, realistic, and complex 3D objects with accurate geometric details in seconds. We achieve state-of-the-art performance in both single-class generation and text-to-3D generation. We also demonstrate that DIRECT-3D can serve as a useful 3D geometric prior of objects, for example to alleviate the well-known Janus problem in 2D-lifting methods such as DreamFusion. The code and models are available for research purposes at: https://github.com/qihao067/direct3d.

DIRECT-3D是一种基于扩散的三维生成模型，从文本提示中创建高质量的三维资产（由神经辐射场表示）；通过直接在大规模无序三维资产上训练，同时过滤和对齐噪声数据，使用迭代优化的扩散过程估计物体的三维姿势并选择有益数据，并通过两个条件性扩散模型实现分离对象几何和颜色特征的高效三维表示；模型能在几秒内生成具有准确几何细节的高质量、高分辨率、逼真而复杂的三维对象，并在单类别生成和文本到三维生成方面达到最先进的性能。

DIRECT-3D: 基于大规模噪声三维数据的直接文本到三维生成学习