vision-language models such as CLIP have shown impressive capabilities in
encoding texts and images into aligned embeddings, enabling the retrieval of
multimodal data in a shared embedding space. However, these embedding-based
models still face challenges in effectively matching images