As the task of 2D-to-3D reconstruction has gained significant attention in various real-world scenarios, it becomes crucial to be able to generate high-quality point clouds. Despite the recent success of deep learning models in generating point clouds, there are still challenges in producing high-fidelity results due to the disparities between images and point clouds. While vision transformers (ViT) and diffusion models have shown promise in various vision tasks, their benefits for reconstructing point clouds from images have not been demonstrated yet. In this paper, we first propose a neat and powerful architecture called DiffPoint that combines ViT and diffusion models for the task of point cloud reconstruction. At each diffusion step, we divide the noisy point clouds into irregular patches. Then, using a standard ViT backbone that treats all inputs as tokens (including time information, image embeddings, and noisy patches), we train our model to predict target points based on input images. We evaluate DiffPoint on both single-view and multi-view reconstruction tasks and achieve state-of-the-art results. Additionally, we introduce a unified and flexible feature fusion module for aggregating image features from single or multiple input images. Furthermore, our work demonstrates the feasibility of applying unified architectures across languages and images to improve 3D reconstruction tasks.

本文提出了一种将ViT和扩散模型结合的DiffPoint架构，用于2D到3D重建的任务，通过将嘈杂的点云分割成不规则块，在每个扩散步骤中利用ViT模型训练以预测目标点，实现了在单视图和多视图重建任务中的最先进结果，并且引入了一种统一且灵活的特征融合模块用于聚合来自不同输入图像的特征，进一步证明了应用统一架构于语言和图像之间以提升3D重建任务的可行性。

DiffPoint: 用基于ViT的扩散模型进行单视点云和多视点云重建