Pre-training models on large scale datasets, like ImageNet, is a standard practice in computer vision. This paradigm is especially effective for tasks with small training sets, for which high-capacity models tend to overfit. In this work, we consider a self-supervised pre-training scenario that only leverages the target task data. We consider datasets, like Stanford Cars, Sketch or COCO, which are order(s) of magnitude smaller than Imagenet. Our study shows that denoising autoencoders, such as BEiT or a variant that we introduce in this paper, are more robust to the type and size of the pre-training data than popular self-supervised methods trained by comparing image embeddings.We obtain competitive performance compared to ImageNet pre-training on a variety of classification datasets, from different domains. On COCO, when pre-training solely using COCO images, the detection and instance segmentation performance surpasses the supervised ImageNet pre-training in a comparable setting.

本研究探讨了只利用目标任务数据的自监督预训练方法，结果显示与ImageNet预训练相比，使用我们介绍的变种BEiT的降噪自编码器方法更适合于类型和数据大小各不相同的预训练数据，这种方法在使用COCO数据进行预训练时，检测和实例分割性能超过了监督的ImageNet预训练方法。

自监督预训练是否需要大规模数据集?