Contrastively trained image-text models such as CLIP, ALIGN, and BASIC have demonstrated unprecedented robustness to multiple challenging natural distribution shifts. Since these image-text models differ from previous training approaches in several ways, an important question is what causes the large robustness gains. We answer this question via a systematic experimental investigation. Concretely, we study five different possible causes for the robustness gains: (i) the training set size, (ii) the training distribution, (iii) language supervision at training time, (iv) language supervision at test time, and (v) the contrastive loss function. Our experiments show that the more diverse training distribution is the main cause for the robustness gains, with the other factors contributing little to no robustness. Beyond our experimental results, we also introduce ImageNet-Captions, a version of ImageNet with original text annotations from Flickr, to enable further controlled experiments of language-image training.

通过实验研究，我们发现对比训练语言-图像模型的鲁棒性提高的主要因素是训练分布的多样性，而其他因素对鲁棒性几乎没有贡献。除了我们的实验结果，我们还介绍了ImageNet-Captions，这是带有来自Flickr的原始文本注释的ImageNet版本，以进一步进行语言-图像训练的受控实验。

数据决定对比语言图像预训练（CLIP）中的分布鲁棒性