This paper introduces a novel Parameter-Efficient Fine-Tuning (PEFT) framework for multi-modal, multi-task transfer learning with pre-trained language models. PEFT techniques such as LoRA, BitFit and IA3 have demonstrated comparable performance to full fine-tuning of pre-trained models for specific downstream tasks, all while demanding significantly fewer trainable parameters and reduced GPU memory consumption. However, in the context of multi-modal fine-tuning, the need for architectural modifications or full fine-tuning often becomes apparent. To address this we propose Context-PEFT, which learns different groups of adaptor parameters based on the token's domain. This approach enables LoRA-like weight injection without requiring additional architectural changes. Our method is evaluated on the COCO captioning task, where it outperforms full fine-tuning under similar data constraints while simultaneously offering a substantially more parameter-efficient and computationally economical solution.

提出了一种用于多模态、多任务迁移学习的新型参数高效调参方法（PEFT）框架，它通过LoRA、BitFit和IA3等技术，在几乎不需要可训练参数和GPU内存的情况下，展示了与预训练模型完全微调相当的性能，然而，在多模态微调中，经常需要进行架构修改或完全微调。为了解决这个问题，我们提出了Context-PEFT，它根据令牌的领域学习不同的适配器参数组，这种方法使得可以实现类似LoRA的权重注入，而不需要额外的架构修改。我们的方法在COCO字幕任务上进行评估，在类似的数据限制下，优于完全微调，并同时提供了更高的参数效率和计算经济性的解决方案。

Context-PEFT: 高效多模态、多任务微调