contrastive language-image pre-training (CLIP), a straightforward yet effective pre-training paradigm, successfully introduces semantic-rich text supervision to vision models and has demonstrated promising results in various tasks due to its generalizability and interpretability. It ha