The Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential. However, these models demand extensive resources during inference. Adaptive attention techniques can dynamically reduce computational redundancy and thus improve efficiency. Although current adaptive attention methods significantly reduce the memory requirements of Transformer-based language models, they are not tailored for LVLMs. We observe that LVLMs generate responses from both remote image tokens and local text tokens, and different modalities have different attention patterns. This observation inspires us to manage the attention for each modality separately. Specifically, for visual input, we store the cache of potentially useful information but only compute the most critical parts. For language input, we care more about local information. Based on our observation and analysis of vision-language attention patterns, we develop A-VL, a plug-and-play adaptive attention tailored for LVLM inference. Extensive evaluations on three vision-language tasks and five datasets show the effectiveness of our designs. Our approach A-VL outperforms existing adaptive attention methods in reducing memory usage and computational load without compromising performance.

本研究针对大型视觉语言模型（LVLM）在推理过程中面临的资源消耗问题，通过提出A-VL自适应注意力技术，分别管理视觉和语言输入的注意力模式，显著降低了内存需求和计算负担。实验结果表明，A-VL在多个视觉语言任务上超过了现有的自适应注意力方法，展示了其在效率和性能上的潜在影响。

自适应注意力的巨大视觉语言模型