contrastive language-image pre-training (CLIP), a straightforward yet
effective pre-training paradigm, successfully introduces semantic-rich text
supervision to vision models and has demonstrated promising results in various
tasks due to its generalizability and interpretability. It ha