Advanced self-supervised visual representation learning methods rely on the instance discrimination (ID) pretext task. We point out that the ID task has an implicit semantic consistency (SC) assumption, which may not hold in unconstrained datasets. In this paper, we propose a novel contrastive mask prediction (CMP) task for visual representation learning and design a mask contrast (MaskCo) framework to implement the idea. MaskCo contrasts region-level features instead of view-level features, which makes it possible to identify the positive sample without any assumptions. To solve the domain gap between masked and unmasked features, we design a dedicated mask prediction head in MaskCo. This module is shown to be the key to the success of the CMP. We evaluated MaskCo on training datasets beyond ImageNet and compare its performance with MoCo V2. Results show that MaskCo achieves comparable performance with MoCo V2 using ImageNet training dataset, but demonstrates a stronger performance across a range of downstream tasks when COCO or Conceptual Captions are used for training. MaskCo provides a promising alternative to the ID-based methods for self-supervised learning in the wild.

本文提出了一种基于掩模对比学习（CMP）的自监督视觉表示法，利用区域级特征对比而不是视角层级特征对比，以消除隐式的语义一致性假设并实现正样本的无假设定位。使用专门的掩模预测头解决了掩模和非掩模特征之间的域差异，实验结果表明该方法在自然数据集上获得了可比较的性能，并且在大量下游任务上比MoCo V2表现更强。

通过对比掩模预测进行自监督视觉表示学习