In Computer Vision, self-supervised contrastive learning enforces similar representations between different views of the same image. The pre-training is most often performed on image classification datasets, like ImageNet, where images mainly contain a single class of objects. However, when dealing with complex scenes with multiple items, it becomes very unlikely for several views of the same image to represent the same object category. In this setting, we propose SAMCLR, an add-on to SimCLR which uses SAM to segment the image into semantic regions, then sample the two views from the same region. Preliminary results show empirically that when pre-training on Cityscapes and ADE20K, then evaluating on classification on CIFAR-10, STL10 and ImageNette, SAMCLR performs at least on par with, and most often significantly outperforms not only SimCLR, but also DINO and MoCo.

在计算机视觉中，自监督对比学习通过使同一图像的不同视图具有类似的表示来实现。我们提出了SAMCLR，它是SimCLR的一个附加部分，使用SAM将图像分割成语义区域，然后从同一区域采样两个视图。初步结果表明，在Cityscapes和ADE20K上进行预训练，然后在CIFAR-10、STL10和ImageNette上进行分类评估时，SAMCLR不仅与SimCLR、DINO和MoCo相当，而且往往明显优于它们。

SAMCLR：利用SAM进行视图采样的复杂场景对比式预训练