Large-scale multi-modal contrastive pre-training has demonstrated great
utility to learn transferable features for a range of downstream tasks by
mapping multiple modalities into a shared embedding space. Typically, this has
employed separate encoders for each modality. However, recent work suggests
that transformers can support learning across multiple modalities and allow
knowledge sharing. Inspired by this, we investigate a variety of
Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
More specifically, we question how many parameters of a transformer model can
be shared across modalities during contrastive pre-training, and rigorously
examine architectural design choices that position the proportion of parameters
shared along a spectrum. In studied conditions, we observe that a mostly
unified encoder for vision and language signals outperforms all other
variations that separate more parameters. Additionally, we find that
light-weight modality-specific parallel modules further improve performance.
Experimental results show that the proposed MS-CLIP approach outperforms
vanilla CLIP by up to 13\% relative in zero-shot ImageNet classification
(pre-trained on YFCC-100M), while simultaneously supporting a reduction of
parameters. In addition, our approach outperforms vanilla CLIP by 1.6 points in
linear probing on a collection of 24 downstream vision tasks. Furthermore, we
discover that sharing parameters leads to semantic concepts from different
modalities being encoded more closely in the embedding space, facilitating the
transferring of common semantic structure (e.g., attention patterns) from
language to vision. Code is available at
\href{this https URL}{URL}.

本篇研究探讨使用 transformer 模型进行多模态对比预训练的方法，发现优于原始的 CLIP 方法，并且支持参数量的降低，通过共享参数，实现了不同模态之间的信息交互和相近语义结构的转移。