multi-modal embeddings form the foundation for vision-language models, such as clip embeddings, the most widely used text-image embeddings. However, these embeddings are vulnerable to subtle misalignment of cross
本论文提出了一种名为Coordinated Vision Language Retrieval(CoVLR)的新方法,利用meta-optimization来协调交叉模态对齐和单模态群集维护,从而同时确保交叉模态一致性和单模态结构,实验结果表明CoVLR方法能够提高单模态检索准确性,同时保留跨模态检索能力。