CLIP, the first foundation model that connects images and text, has enabled many recent breakthroughs in computer vision. However, its associated training cost is prohibitively high, imposing a significant barrier to its widespread exploration. In this paper, we present a surprising finding that there exists an inverse scaling law for CLIP training, whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. Moreover, we showcase that the strategy for reducing image/text token length plays a crucial role in determining the quality of this scaling law. As a result of this finding, we are able to successfully train CLIP even by using academic resources. For example, on an A100 eight-GPU server, our CLIP models achieve zero-shot top-1 ImageNet accuracies of 63.2% in ~2 days, 67.8% in ~3 days, and 69.3% in ~4 days. By reducing the computation barrier associated with CLIP, we hope to inspire more research in this field, particularly from academics. Our code is available at https://github.com/UCSC-VLAA/CLIPA.

本文发现了CLIP训练的反比例缩放定律，即使用更大的图像/文本编码器，可以应用更短的图像/文本令牌序列进行训练，并通过减少计算障碍成功地训练CLIP，在 A100 八 GPU 服务器上，我们的 CLIP 模型在 ~2 天、~3 天和~4 天内实现了零样本 top-1 ImageNet 准确率分别为 63.2％、67.8％ 和 69.3％，希望能够鼓舞更多学术领域的研究。