We present Dual3D, a novel text-to-3D generation framework that generates high-quality 3D assets from texts in only $1$ minute.The key component is a dual-mode multi-view latent diffusion model. Given the noisy multi-view latents, the 2D mode can efficiently denoise them with a single latent denoising network, while the 3D mode can generate a tri-plane neural surface for consistent rendering-based denoising. Most modules for both modes are tuned from a pre-trained text-to-image latent diffusion model to circumvent the expensive cost of training from scratch. To overcome the high rendering cost during inference, we propose the dual-mode toggling inference strategy to use only $1/10$ denoising steps with 3D mode, successfully generating a 3D asset in just $10$ seconds without sacrificing quality. The texture of the 3D asset can be further enhanced by our efficient texture refinement process in a short time. Extensive experiments demonstrate that our method delivers state-of-the-art performance while significantly reducing generation time. Our project page is available at https://dual3d.github.io

我们提出了Dual3D，一种新颖的文本到3D生成框架，仅需1分钟从文本生成高质量的3D资产。其中关键组件是双模态多视图潜在扩散模型，通过单个潜在去噪网络可以有效去噪多视图潜在，在3D模式下可以生成一致渲染的三面神经表面实现去噪。我们通过预训练的文本到图像潜在扩散模型调整多数模块，避免了从头训练的昂贵代价。同时，我们提出了双模态切换推理策略，仅使用1/10的去噪步骤和3D模式，在仅10秒的时间内成功生成高质量的3D资产，同时可以通过高效的纹理细化过程进一步增强3D资产的纹理，大量实验证明我们的方法在显著减少生成时间的同时提供了最先进的性能。

Dual3D: 双模多视角潜在扩散下高效一致的文本到3D生成