We present a latent diffusion model over 3D scenes, that can be trained using only 2D image data. To achieve this, we first design an autoencoder that maps multi-view images to 3D Gaussian splats, and simultaneously builds a compressed latent representation of these splats. Then, we train a multi-view diffusion model over the latent space to learn an efficient generative model. This pipeline does not require object masks nor depths, and is suitable for complex scenes with arbitrary camera positions. We conduct careful experiments on two large-scale datasets of complex real-world scenes -- MVImgNet and RealEstate10K. We show that our approach enables generating 3D scenes in as little as 0.2 seconds, either from scratch, from a single input view, or from sparse input views. It produces diverse and high-quality results while running an order of magnitude faster than non-latent diffusion models and earlier NeRF-based generative models

我们提出了一个通过2D图像数据训练的3D场景潜在扩散模型，首先设计了一个将多视图图像映射到3D高斯斑点并在同时构建这些斑点的压缩潜在表示的自编码器，然后在潜在空间上培训多视图扩散模型，以学习一种高效的生成模型。该方法不需要对象掩码或深度信息，适用于具有任意相机位置的复杂场景。我们在两个大规模复杂真实世界场景数据集MVImgNet和RealEstate10K上进行了仔细实验。与非潜在扩散模型和早期NeRF-based生成模型相比，我们的方法无论是从头开始，从单个输入视图开始还是从稀疏输入视图开始，都能在0.2秒内生成3D场景，并产生多样且高质量的结果，速度提高了一个数量级。

用潜在扩散模型在几秒钟内对3D高斯场景进行采样