We propose GS-LRM, a scalable large reconstruction model that can predict high-quality 3D Gaussian primitives from 2-4 posed sparse images in 0.23 seconds on single A100 GPU. Our model features a very simple transformer-based architecture; we patchify input posed images, pass the concatenated multi-view image tokens through a sequence of transformer blocks, and decode final per-pixel Gaussian parameters directly from these tokens for differentiable rendering. In contrast to previous LRMs that can only reconstruct objects, by predicting per-pixel Gaussians, GS-LRM naturally handles scenes with large variations in scale and complexity. We show that our model can work on both object and scene captures by training it on Objaverse and RealEstate10K respectively. In both scenarios, the models outperform state-of-the-art baselines by a wide margin. We also demonstrate applications of our model in downstream 3D generation tasks. Our project webpage is available at: https://sai-bi.github.io/project/gs-lrm/ .

我们提出了GS-LRM，一个可扩展的大型重建模型，可以在单个A100 GPU上从2-4个稀疏图像中预测高质量的3D高斯原语，并在0.23秒内完成。我们的模型采用了非常简单的基于transformer的架构；我们对输入的图像进行了分块处理，通过一系列的transformer块将连接的多视图图像令牌传递，并直接从这些令牌解码出每像素的高斯参数以进行可微渲染。与之前仅能重建对象的低秩模型不同，GS-LRM通过预测每像素的高斯分布，能够自然地处理具有不同尺度和复杂性的场景。我们展示了我们的模型可以适用于对象和场景捕捉，并通过在Objaverse和RealEstate10K上进行训练，在这两种情景下，我们的模型都比现有的最先进方法表现得更好。我们还展示了我们模型在下游3D生成任务中的应用。我们的项目网页位于：this https URL。

GS-LRM: 三维高斯喷射的大规模重建模型