Vision-Language Models (VLMs) have demonstrated impressive performance across a versatile set of tasks. A key challenge in accelerating VLMs is storing and accessing the large Key-Value (KV) cache that encodes long visual contexts, such as images or videos. While existing KV cache compression methods are effective for Large Language Models (LLMs), directly migrating them to VLMs yields suboptimal accuracy and speedup. To bridge the gap, we propose VL-Cache, a novel KV cache compression recipe tailored for accelerating VLM inference. In this paper, we first investigate the unique sparsity pattern of VLM attention by distinguishing visual and text tokens in prefill and decoding phases. Based on these observations, we introduce a layer-adaptive sparsity-aware cache budget allocation method that effectively distributes the limited cache budget across different layers, further reducing KV cache size without compromising accuracy. Additionally, we develop a modality-aware token scoring policy to better evaluate the token importance. Empirical results on multiple benchmark datasets demonstrate that retaining only 10% of KV cache achieves accuracy comparable to that with full cache. In a speed benchmark, our method accelerates end-to-end latency of generating 100 tokens by up to 2.33x and speeds up decoding by up to 7.08x, while reducing the memory footprint of KV cache in GPU by 90%.

本研究解决了加速视觉语言模型(VLM)推理过程中KV缓存存储和访问效率低下的问题。提出了一种新颖的KV缓存压缩方法VL-Cache，基于VLM的稀疏性特点和模态意识的策略，显著提高了推理速度和准确率。实验结果表明，使用该方法仅保留10%的KV缓存就可实现与完整缓存相媲美的准确性，同时在推理延迟和内存占用方面实现了显著改善。

VL-Cache：针对视觉语言模型推理加速的稀疏性与模态意识KV缓存压缩