Recently, fine-tuning language models pre-trained on large text corpora have provided huge improvements on vision-and-language (V&L) tasks as well as on pure language tasks. However, fine-tuning the entire parameter set of pre-trained models becomes impractical since the model size is growing rapidly. Hence, in this paper, we introduce adapter-based parameter-efficient transfer learning techniques to V&L models such as VL-BART and VL-T5. We evaluate our methods in a unified multi-task setup on four diverse V&L tasks: VQAv2, GQA, NLVR2 , and MSCOCO image captioning. With careful training and thorough experiments, we benchmark three popular adapter-based methods (Adapter, Hyperformer, Compacter) against the standard full fine-tuning and the recently proposed prompt-tuning approach. We also enhance the efficiency and performance of adapters by sharing their weights to attain knowledge across tasks. Our results demonstrate that training the adapter with the weight-sharing technique (4.4% of total parameters) can match the performance of fine-tuning the entire model. Lastly, we present a comprehensive analysis including the combination of adapter and task-specific prompts and the impact of V&L pre-training on adapters. Our code is available at: https://github.com/ylsung/VL_adapter.

本文提出基于adapter的参数高效迁移学习技术，以VL-BART和VLT5为例，在图像文本和视频文本基准测试上统一多任务设置，通过权重共享提高adapter的效率和性能，在图像文本任务和视频文本任务中将adapter的使用提升至总参数的4.18%和3.39%，匹配了整个模型微调的性能，同时对adapter与任务特定提示的组合及V&L预训练对adapter的影响进行了综合分析。

VL-Adapter：用于视觉语言任务的参数效率转移学习