Generating 3D scenes is a challenging open problem, which requires synthesizing plausible content that is fully consistent in 3D space. While recent methods such as neural radiance fields excel at view synthesis and 3D reconstruction, they cannot synthesize plausible details in unobserved regions since they lack a generative capability. Conversely, existing generative methods are typically not capable of reconstructing detailed, large-scale scenes in the wild, as they use limited-capacity 3D scene representations, require aligned camera poses, or rely on additional regularizers. In this work, we introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes. To achieve this, we make three contributions. First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes, dynamically allocating more capacity as needed to capture details visible in each image. Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images without the need for any additional supervision signal such as masks or depths. This supports 3D reconstruction and generation in a unified architecture. Third, we develop a principled approach to avoid trivial 3D solutions when integrating the image-based rendering with the diffusion model, by dropping out representations of some images. We evaluate the model on several challenging datasets of real and synthetic images, and demonstrate superior results on generation, novel view synthesis and 3D reconstruction.

通过引入第一种能够快速进行真实世界三维场景的详细重建和生成的推广模型，我们在本研究中提出了三个贡献：首先，引入了一种新的神经场景表示方法IB-planes，能够有效准确地表示大型三维场景，并在需要时动态分配更多容量以捕捉每张图像中可见的细节；其次，我们提出了一种去噪扩散框架，通过仅使用二维图像而不需要额外的监督信号（如掩码或深度）学习对这种新型三维场景表示的先验知识，从而支持三维重建和生成；第三，我们开发了一种避免将基于图像渲染与扩散模型集成时产生平凡三维解决方案的原则性方法，即通过丢弃某些图像的表示。我们在几个具有挑战性的真实和合成图像数据集上评估了该模型，并在生成、新视图合成和三维重建方面展示了优越的结果。

基于图像渲染的去噪传播