Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists. In our study, we provide a theoretical explanation of this space's geometry and introduce a three-step method, $C^3$ (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings. Our $C^3$ method significantly improves cross-modal learning from uni-modal data, achieving state-of-the-art results on zero-shot image / audio / video captioning and text-to-image generation.

利用预训练的多模态对比表示空间可以从单模态数据中学习跨模态任务，我们提供了这个空间几何的理论解释，并引入了一个三步方法（连接、降维、破坏）来缩小模态差距，增强嵌入的互换性，实现了从单模态数据中有效地进行跨模态学习，取得了零样本图像/音频/视频字幕和文本到图像生成的最新成果。

连接，塌陷，腐败：利用单模态数据学习跨模态任务