Large-scale multi-modal contrastive pre-training has demonstrated great
utility to learn transferable features for a range of downstream tasks by
mapping multiple modalities into a shared embedding space. Typically, this has
employed separate encoders for each modality. However, recent work suggests
that transformers can support learning across multiple modalities and allow
knowledge sharing. Inspired by this, we investigate a variety of
Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
More specifically, we question how many parameters of a transformer model can
be shared across modalities during contrastive pre-training, and rigorously
examine architectural design choices that position the proportion of parameters
shared along a spectrum. In studied conditions, we observe that a mostly
unified encoder for vision and language signals outperforms all other
variations that separate more parameters. Additionally, we find that
light-weight modality-specific parallel modules further improve performance.
Experimental results show that the proposed MS-CLIP approach outperforms
vanilla CLIP by up to 13\% relative in zero-shot ImageNet classification
(pre-trained on YFCC-100M), while simultaneously supporting a reduction of
parameters. In addition, our approach outperforms vanilla CLIP by 1.6 points in
linear probing on a collection of 24 downstream vision tasks. Furthermore, we
discover that sharing parameters leads to semantic concepts from different
modalities being encoded more closely in the embedding space, facilitating the
transferring of common semantic structure (e.g., attention patterns) from
language to vision. Code is available at
\href{this https URL}{URL}.

本篇研究探讨使用 transformer 模型进行多模态对比预训练的方法，发现优于原始的 CLIP 方法，并且支持参数量的降低，通过共享参数，实现了不同模态之间的信息交互和相近语义结构的转移。

从共享对比语言图像预训练中学习视觉表征

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

Video-text retrieval has been a crucial and fundamental task in multi-modal
research. The development of video-text retrieval has been considerably
promoted by large-scale multi-modal contrastive pre-training, which primarily
focuses on coarse-grained or fine-grained contrast. However, cross-grained
contrast, which is the contrast between coarse-grained representations and
fine-grained representations, has rarely been explored in prior research.
Compared with fine-grained or coarse-grained contrasts, cross-grained contrast
calculate the correlation between coarse-grained features and each fine-grained
feature, and is able to filter out the unnecessary fine-grained features guided
by the coarse-grained feature during similarity calculation, thus improving the
accuracy of retrieval. To this end, this paper presents a novel multi-grained
contrastive model, namely X-CLIP, for video-text retrieval. However, another
challenge lies in the similarity aggregation problem, which aims to aggregate
fine-grained and cross-grained similarity matrices to instance-level
similarity. To address this challenge, we propose the Attention Over Similarity
Matrix (AOSM) module to make the model focus on the contrast between essential
frames and words, thus lowering the impact of unnecessary frames and words on
retrieval results. With multi-grained contrast and the proposed AOSM module,
X-CLIP achieves outstanding performance on five widely-used video-text
retrieval datasets, including MSR-VTT (49.3 R@1), MSVD (50.4 R@1), LSMDC (26.1
R@1), DiDeMo (47.8 R@1) and ActivityNet (46.2 R@1). It outperforms the previous
state-of-theart by +6.3%, +6.6%, +11.1%, +6.7%, +3.8% relative improvements on
these benchmarks, demonstrating the superiority of multi-grained contrast and
AOSM.

本篇论文提出了一种名为 X-CLIP 的多层次对比模型，通过 Attention Over Similarity Matrix 模块将多粒度相似度矩阵聚合到实例级别，大幅度提高了视频 - 文本检索的性能表现。在五个常用的视频文本检索数据集上，X-CLIP 相较于之前最先进的模型提升了 6.3％至 11.1％，证明了多层次对比模型和 AOSM 模块的优越性。