To bridge the gap between vision and language modalities, Multimodal Large Language Models (MLLMs) usually learn an adapter that converts visual inputs to understandable tokens for Large Language Models (LLMs). However, most adapters generate consistent visual tokens, regardless of the specific objects of interest mentioned in the prompt. Since these adapters distribute equal attention to every detail in the image and focus on the entire scene, they may increase the cognitive load for LLMs, particularly when processing complex scenes. To alleviate this problem, we propose prompt-aware adapters. These adapters are designed with the capability to dynamically embed visual inputs based on the specific focus of the prompt. Specifically, prompt-aware adapters utilize both global and local textual features to capture the most relevant visual clues from the prompt at both coarse and fine granularity levels. This approach significantly enhances the ability of LLMs to understand and interpret visual content. Experiments on various visual question answering tasks, such as counting and position reasoning, demonstrate the effectiveness of prompt-aware adapters.

为了弥补视觉和语言模态之间的差距，我们提出了prompt-aware适配器，这些适配器根据提示的特定焦点动态嵌入视觉输入，以从提示中捕捉到最相关的视觉线索，从而显著增强了大型语言模型理解和解释视觉内容的能力。实验表明prompt-aware适配器在各种视觉问答任务（如计数和位置推理）中的有效性。

意识到提示的适配器：为多模态大型语言模型学习自适应的视觉特征