Masked autoencoding has achieved great success for self-supervised learning in the image and language domains. However, mask based pretraining has yet to show benefits for point cloud understanding, likely due to standard backbones like PointNet being unable to properly handle the training versus testing distribution mismatch introduced by masking during training. In this paper, we bridge this gap by proposing a discriminative mask pretraining Transformer framework, MaskPoint}, for point clouds. Our key idea is to represent the point cloud as discrete occupancy values (1 if part of the point cloud; 0 if not), and perform simple binary classification between masked object points and sampled noise points as the proxy task. In this way, our approach is robust to the point sampling variance in point clouds, and facilitates learning rich representations. We evaluate our pretrained models across several downstream tasks, including 3D shape classification, segmentation, and real-word object detection, and demonstrate state-of-the-art results while achieving a significant pretraining speedup (e.g., 4.1x on ScanNet) compared to the prior state-of-the-art Transformer baseline. Code will be publicly available at https://github.com/haotian-liu/MaskPoint.

本研究提出了一种基于 Transformer 的具有区分性的掩码预训练框架 MaskPoint，该框架使用离散的占用值表示点云，通过简单的二元分类来代理掩盖的对象点和采样的噪声点，从而使其具有鲁棒性。该预训练模型在多个下游任务中表现优异，包括 3D 形状分类、分割和真实世界物体检测。

点云自监督学习的遮掩式判别