Many recent approaches in contrastive learning have worked to close the gap between pretraining on iconic images like ImageNet and pretraining on complex scenes like COCO. This gap exists largely because commonly used random crop augmentations obtain semantically inconsistent content in crowded scene images of diverse objects. Previous works use preprocessing pipelines to localize salient objects for improved cropping, but an end-to-end solution is still elusive. In this work, we propose a framework which accomplishes this goal via joint learning of representations and segmentation. We leverage segmentation masks to train a model with a mask-dependent contrastive loss, and use the partially trained model to bootstrap better masks. By iterating between these two components, we ground the contrastive updates in segmentation information, and simultaneously improve segmentation throughout pretraining. Experiments show our representations transfer robustly to downstream tasks in classification, detection and segmentation.

本研究提出了一种通过联合学习表示和分割来达到在特定场景（如COCO）上预训练模型和在Iconic图片（如ImageNet）上预训练模型之间的准确度缩小的框架，结果发现相对于之前的方法，在分类、检测和分割等下游任务中表现得更稳健。

CYBORGS: 通过基于分割的文本信息对比性地增强物体表示