We propose a Pose-Free Large Reconstruction Model (PF-LRM) for reconstructing a 3D object from a few unposed images even with little visual overlap, while simultaneously estimating the relative camera poses in ~1.3 seconds on a single A100 GPU. PF-LRM is a highly scalable method utilizing the self-attention blocks to exchange information between 3D object tokens and 2D image tokens; we predict a coarse point cloud for each view, and then use a differentiable Perspective-n-Point (PnP) solver to obtain camera poses. When trained on a huge amount of multi-view posed data of ~1M objects, PF-LRM shows strong cross-dataset generalization ability, and outperforms baseline methods by a large margin in terms of pose prediction accuracy and 3D reconstruction quality on various unseen evaluation datasets. We also demonstrate our model's applicability in downstream text/image-to-3D task with fast feed-forward inference. Our project website is at: https://totoro97.github.io/pf-lrm .

我们提出了一种无姿势大重建模型（PF-LRM），可以从少数非姿势图像中重建三维物体，即使视觉重叠很少，同时在单个A100 GPU上估计相对相机姿势，仅需约1.3秒。PF-LRM是一种高度可扩展的方法，利用自注意力块在三维物体标记和二维图像标记之间交换信息；我们为每个视图预测粗略的点云，然后使用可微的透视-n-点（PnP）求解器获得相机姿势。当在约1M个多视图姿势数据上进行训练时，PF-LRM表现出强大的跨数据集泛化能力，并在各种未见评估数据集上以姿势预测准确性和三维重建质量大幅超越基线方法。我们还展示了模型在下游文本/图像到三维任务中的适用性，并具有快速的前馈推理。项目网站位于: this https URL。

PF-LRM：用于联合姿态和形状预测的无姿态大型重建模型