In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens and require additional training data. Differently, we propose an efficient training-free token optimization mechanism dubbed SparseVLM without extra parameters or fine-tuning costs. Concretely, given that visual tokens complement text tokens in VLMs for linguistic reasoning, we select visual-relevant text tokens to rate the significance of vision tokens within the self-attention matrix extracted from the VLMs. Then we progressively prune irrelevant tokens. To maximize sparsity while retaining essential information, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks. In particular, LLaVA equipped with SparseVLM reduces 61% to 67% FLOPs with a compression ratio of 78% while maintaining 93% of the accuracy. Our code is available at https://github.com/Gumpest/SparseVLMs.

本研究针对视觉-语言模型中视觉令牌占用大量计算资源的问题，提出了一种无额外训练的数据的高效令牌优化机制SparseVLM。该方法通过自注意力矩阵中的相关文本令牌选择视觉令牌的显著性，逐步修剪无关令牌，显著提高了多个视觉-语言模型在图像和视频理解任务中的效率，同时保持了高准确率。

SparseVLM：用于高效视觉-语言模型推理的视觉令牌稀疏化