Aligned text-image encoders such as CLIP have become the de facto model for vision-language tasks. Furthermore, modality-specific encoders achieve impressive performances in their respective domains. This raises a central question: does an alignment exist between uni-modal vision and language encoders since they fundamentally represent the same physical world? Analyzing the latent spaces structure of vision and language models on image-caption benchmarks using the Centered Kernel Alignment (CKA), we find that the representation spaces of unaligned and aligned encoders are semantically similar. In the absence of statistical similarity in aligned encoders like CLIP, we show that a possible matching of unaligned encoders exists without any training. We frame this as a seeded graph-matching problem exploiting the semantic similarity between graphs and propose two methods - a Fast Quadratic Assignment Problem optimization, and a novel localized CKA metric-based matching/retrieval. We demonstrate the effectiveness of this on several downstream tasks including cross-lingual, cross-domain caption matching and image classification.

通过使用 Centered Kernel Alignment (CKA) 分析图像字幕基准上视觉和语言模型的潜在空间结构，我们发现不对齐和对齐的编码器的表示空间在语义上是相似的。在无统计相似性的情况下，我们展示了存在可能匹配不对齐编码器而无需任何训练。我们将其视为一种基于种子图匹配问题，利用图之间的语义相似性提出了两种方法 - 一种是快速二次分配问题优化，一种是基于新颖局部CKA度量的匹配/检索。我们在包括跨语言、跨域字幕匹配和图像分类在内的几个下游任务上展示了其有效性。

视觉和语言编码器是否相似地代表世界？