Severe data imbalance naturally exists among web-scale vision-language datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning, and demonstrates significant effectiveness in learning generalizable representations. With an aim to investigate the reasons behind this finding, we conduct controlled experiments to study various underlying factors, and reveal that CLIP's pretext task forms a dynamic classification problem wherein only a subset of classes is present in training. This isolates the bias from dominant classes and implicitly balances the learning signal. Furthermore, the robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts, which are inaccessible to supervised learning. Our study not only uncovers the mechanisms behind CLIP's generalizability beyond data imbalance but also provides transferable insights for the research community. The findings are validated in both supervised and self-supervised learning, enabling models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks. Code will be available at: https://github.com/CVMI-Lab/clip-beyond-tail.

研究发现CLIP预训练在面对数据不平衡时相比于监督学习表现出明显的鲁棒性和学习泛化能力。通过对各种潜在因素的控制实验研究，揭示了CLIP预训练的伪任务形成了一个动态分类问题，在训练中只包含部分类别，从而消除了主导类别的偏差且隐含地实现了学习信号的平衡。此外，CLIP的鲁棒性和区分能力随着更具描述性的语言监督、更大规模的数据以及更广泛的开放世界概念的使用而提高，而这些在监督学习中是无法实现的。该研究不仅揭示了CLIP在数据不平衡情况下的泛化机制，还为研究界提供了有价值的启示。通过监督学习和自监督学习验证了这些发现，使得在不平衡数据上训练的模型能够在多样化的识别任务上达到CLIP级别的性能。

超越数据不平衡的泛化：对CLIP进行可控研究以获取可转移的洞见