We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds. In contrast to many previous methods that are trained on small-scale datasets such as ShapeNet in a category-specific fashion, LRM adopts a highly scalable transformer-based architecture with 500 million learnable parameters to directly predict a neural radiance field (NeRF) from the input image. We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects, including both synthetic renderings from Objaverse and real captures from MVImgNet. This combination of a high-capacity model and large-scale training data empowers our model to be highly generalizable and produce high-quality 3D reconstructions from various testing inputs including real-world in-the-wild captures and images from generative models. Video demos and interactable 3D meshes can be found on this website: https://yiconghong.me/LRM/.

我们提出了第一个大规模重建模型 (LRM)，能够在仅5秒内从单个输入图像预测对象的3D模型。与许多以类别为基础在小规模数据集（如ShapeNet）上训练的先前方法不同，LRM采用一个高度可扩展的基于transformer的架构，具有5亿个可学习参数，能够直接从输入图像预测神经辐射场（NeRF）。我们以端到端的方式在包含大约100万个对象的海量多视图数据上训练我们的模型，包括Objaverse的合成渲染和MVImgNet的真实采集数据。这种高容量模型和大规模的训练数据的组合使得我们的模型具有很强的通用性，并能够从各种测试输入中生成高质量的3D重建结果，包括真实世界中的野外捕捉和生成模型的图像。可在此网站找到视频演示和可交互的3D网格：[https://this_URL]。

LRM：单幅图像到3D的大规模重建模型