Most of the existing self-supervised feature learning methods for 3D data either learn 3D features from point cloud data or from multi-view images. By exploring the inherent multi-modality attributes of 3D objects, in this paper, we propose to jointly learn modal-invariant and view-invariant features from different modalities including image, point cloud, and mesh with heterogeneous networks for 3D data. In order to learn modal- and view-invariant features, we propose two types of constraints: cross-modal invariance constraint and cross-view invariant constraint. Cross-modal invariance constraint forces the network to maximum the agreement of features from different modalities for same objects, while the cross-view invariance constraint forces the network to maximum agreement of features from different views of images for same objects. The quality of learned features has been tested on different downstream tasks with three modalities of data including point cloud, multi-view images, and mesh. Furthermore, the invariance cross different modalities and views are evaluated with the cross-modal retrieval task. Extensive evaluation results demonstrate that the learned features are robust and have strong generalizability across different tasks.

该论文提出了基于异构网络的多模式和多视角无关特征学习方法，通过两种约束条件实现特征信息的跨模式和跨视角一致性，并在三种数据模态下进行了验证。实验结果表明该方法能够提取出鲁棒性较强的高质量特征。

自监督的模态与视角不变特征学习