Mert Bulent Sariyildiz, Julien Perez, Diane Larlus
TL;DR使用图像和标题的联合信息进行预训练可提高图像表征能力,该方法通过 image-conditioned masked language modeling(ICMLM)任务来实现,训练出的表征能够成功应用于多种目标任务。
Abstract
pretraining general-purpose visual features has become a crucial part of tackling many computer vision tasks. While one can learn such features on the extensively-annotated ImageNet dataset, recent approaches hav