Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm: projecting the output of pre-trained vision encoders to the input space of pre-trained language models as visual prompts; and then transferring the models to downstream VL tasks via end-to-end parameter-efficient fine-tuning (PEFT). However, this paradigm still exhibits inefficiency since it significantly increases the input length of the language models. In this paper, in contrast to integrating visual prompts into inputs, we regard visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information. Motivated by the finding that Feed-Forward Network (FFN) of language models acts as "key-value memory", we introduce a novel approach termed memory-space visual prompting (MemVP), wherein visual prompts are concatenated with the weights of FFN for visual knowledge injection. Experimental results across various VL tasks and language models reveal that MemVP significantly reduces the training time and inference latency of the finetuned VL models and surpasses the performance of previous PEFT methods. Code: https://github.com/JieShibo/MemVP

当前关于高效构建大型视觉语言模型的解决方案采用两步骤范式：将预训练视觉编码器的输出投射到预训练语言模型的输入空间作为视觉提示，然后通过端到端参数高效调优（PEFT）将模型转移到下游视觉语言任务。然而，这一范式仍然存在低效性，因为它显著增加了语言模型的输入长度。本文提出了一种新颖的方法，称为内存空间视觉提示（MemVP），与将视觉提示集成到输入不同，我们将视觉提示视为有助于语言模型处理与视觉信息相关任务的附加知识。通过在语言模型的前馈网络（FFN）中加入视觉提示与权重的连接，MemVP方法大大减少了微调视觉语言模型的训练时间和推理延迟，并且在各种视觉语言任务和语言模型上的实验证明其性能超越了先前的PEFT方法。

面向高效视觉-语言微调的记忆空间视觉提示