contrastive image-text models such as CLIP form the building blocks of many
state-of-the-art systems. While they excel at recognizing common generic
concepts, they still struggle on fine-grained entities which are rare, or even
absent from the pre-training dataset. Hence, a key ingredi