We propose a unified point cloud video self-supervised learning framework for object-centric and scene-centric data. Previous methods commonly conduct representation learning at the clip or frame level and cannot well capture fine-grained semantics. Instead of contrasting the representations of clips or frames, in this paper, we propose a unified self-supervised framework by conducting contrastive learning at the point level. Moreover, we introduce a new pretext task by achieving semantic alignment of superpoints, which further facilitates the representations to capture semantic cues at multiple scales. In addition, due to the high redundancy in the temporal dimension of dynamic point clouds, directly conducting contrastive learning at the point level usually leads to massive undesired negatives and insufficient modeling of positive representations. To remedy this, we propose a selection strategy to retain proper negatives and make use of high-similarity samples from other instances as positive supplements. Extensive experiments show that our method outperforms supervised counterparts on a wide range of downstream tasks and demonstrates the superior transferability of the learned representations.

我们提出了一个统一的点云视频自监督学习框架，用于面向对象和面向场景的数据。通过在点级别进行对比学习，我们的方法能够捕捉到细粒度语义。同时，我们引入了一个新的预训练任务，通过实现超点的语义对齐来进一步提高表示能力。此外，为了解决动态点云时间维度的高冗余性问题，我们提出了一种选择策略来保留适当的负样本，并利用其他实例中的高相似样本作为正样本的补充。大量实验证明我们的方法在各种下游任务上优于有监督对应方法，并展示了学到的表示的卓越可迁移性。

基于语义聚类的点云视频自监督学习的点对比预测