Large language models (LLMs) have shown remarkable potential in processing long sequences, yet efficiently serving these long-context models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is released at https://github.com/mit-han-lab/omniserve.

本研究解决了长序列大型语言模型（LLM）在预填充阶段的计算复杂度和解码阶段的内存占用问题。提出的LServe系统通过混合稀疏注意力加速LLM服务，融合了不同的稀疏模式，为预填充和解码阶段的注意力计算提供了统一框架。研究表明，该系统可以在保持长序列精度的同时，使LLM预填充速度提升近2.9倍，解码速度提升1.3-2.1倍。

LServe：统一稀疏注意力的高效长序列LLM服务