As the size of Large Multi-Modal Models (LMMs) increases consistently, the
adaptation of these pre-trained models to specialized tasks has become a
computationally and memory-intensive challenge. Traditional fine-tuning methods
require isolated, exhaustive retuning for each new task, limiting the models'
versatility. Moreover, current efficient adaptation techniques often overlook
modality alignment, focusing only on the knowledge extraction of new tasks. To
tackle these issues, we introduce Multiway-Adapter, an innovative framework
incorporating an 'Alignment Enhancer' to deepen modality alignment, enabling
high transferability without tuning pre-trained parameters. Our method adds
fewer than 1.25\% of additional parameters to LMMs, exemplified by the BEiT-3
model in our study. This leads to superior zero-shot image-text retrieval
performance compared to fully fine-tuned models, while achieving up to a 57\%
reduction in fine-tuning time. Our approach offers a resource-efficient and
effective adaptation pathway for LMMs, broadening their applicability. The
source code is publicly available at:
https://github.com/longkukuhi/MultiWay-Adapter.

通过引入 Multiway-Adapter 框架和 'Alignment Enhancer' 来深化多模态对齐，我们提出了一种高效的适应路径，使得大型多模态模型具备高度的可迁移性，同时实现了 57% 的微调时间缩减，并在零样本图像 - 文本检索任务中表现出优异的性能。

多路适配器：为可扩展的图像 - 文本检索适应大规模多模态模型

MultiWay-Adapater: Adapting large-scale multi-modal models for scalable  image-text retrieval

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an
image-text pretraining methodology that achieves simultaneous learning of
image- and region-level representation for open-vocabulary object detection
(OVD). Our approach combines the masked autoencoder (MAE) objective into the
contrastive learning objective to improve the representation for localization
tasks. Unlike standard MAE, we perform reconstruction in the joint image-text
embedding space, rather than the pixel space as is customary with the classical
MAE method, which causes the model to better learn region-level semantics.
Moreover, we introduce Positional Embedding Dropout (PED) to address scale
variation between image-text pretraining and detection finetuning by randomly
dropping out the positional embeddings during pretraining. PED improves
detection performance and enables the use of a frozen ViT backbone as a region
classifier, preventing the forgetting of open-vocabulary knowledge during
detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT
achieves a state-of-the-art 33.9 AP$r$, surpassing the best approach by 7.6
points and achieves better zero-shot detection transfer. Finally, CFM-ViT
acquires strong image-level representation, outperforming the state of the art
on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.

CFM-ViT 是一种图像 - 文本预训练方法，具有对开放词汇目标检测进行图像和区域级别表示的同时学习能力。通过将掩码自编码器（MAE）目标与对比学习目标相结合，CFM-ViT 在联合图像 - 文本嵌入空间中进行重构，以比传统的 MAE 方法更好地学习区域级语义。此外，引入位置嵌入丢弃（PED）来解决图像 - 文本预训练和检测微调之间的尺度变化，从而提高检测性能并利用冻结的 ViT 骨干作为区域分类器，避免在检测微调过程中遗忘开放词汇知识。在 LVIS 开放词汇检测基准下，CFM-ViT 实现了 33.9 AP$r$ 的最新成果，超过最佳方法 7.6 个点，并在零样本检测转移方面取得更好的效果。最后，CFM-ViT 获得了强大的图像级表示，在 8 个零样本图像 - 文本检索基准中表现出了优于当前技术水平的成绩。