Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP's performance than increasing its volume. Nevertheless, finding small subsets of training data that provably generalize the best has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that closely preserve the cross-covariance of the images and captions of the full data provably achieve a superior generalization performance. Our extensive experiments on ConceptualCaptions3M and ConceptualCaptions12M demonstrate that subsets found by \method\ achieve over 2.7x and 1.4x the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subsets obtain 1.5x the average accuracy across 11 downstream datasets, of the next best baseline. The code is available at: https://github.com/BigML-CS-UCLA/clipcov-data-efficient-clip.

我们提出了一种理论上严谨的数据选择方法，通过保留图像和标题的交叉协方差，提高Contrastive Language-Image Pre-training模型的泛化性能，并在ConceptualCaptions3M和ConceptualCaptions12M上进行的实验证明，我们的子集相比其他基线方法，可在ImageNet和其变体上实现超过2.7倍和1.4倍的准确度，同时，在11个下游数据集中平均准确度达到其他基线方法的1.5倍。

高效对比语言-图像预训练：数据质量优先于数量