Efficient use of GPU memory is essential for high throughput LLM inference.
Prior systems reserved memory for the KV-cache ahead-of-time, resulting in
wasted capacity due to internal fragmentation. Inspired by OS-based virtual
memory systems, vLLM proposed PagedAttention to enable dynamic memory
allocation for KV-cache. This approach eliminates fragmentation, enabling
high-throughput LLM serving with larger batch sizes. However, to be able to
allocate physical memory dynamically, PagedAttention changes the layout of
KV-cache from contiguous virtual memory to non-contiguous virtual memory. This
change requires attention kernels to be rewritten to support paging, and
serving framework to implement a memory manager. Thus, the PagedAttention model
leads to software complexity, portability issues, redundancy and inefficiency.
In this paper, we propose vAttention for dynamic KV-cache memory management.
In contrast to PagedAttention, vAttention retains KV-cache in contiguous
virtual memory and leverages low-level system support for demand paging, that
already exists, to enable on-demand physical memory allocation. Thus,
vAttention unburdens the attention kernel developer from having to explicitly
support paging and avoids re-implementation of memory management in the serving
framework. We show that vAttention enables seamless dynamic memory management
for unchanged implementations of various attention kernels. vAttention also
generates tokens up to 1.97x faster than vLLM, while processing input prompts
up to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention
and FlashInfer.

为了解决 GPU 内存的高吞吐量 LLM 推理的问题，该论文提出了 vAttention 的动态 KV-cache 内存管理方法，相较于 PagedAttention 模型，vAttention 在连续虚拟内存中保留 KV-cache，并利用现有的低层系统支持以实现按需物理内存分配。

vAttention：为无需 PagedAttention 的 LLM 提供动态内存管理

vAttention: Dynamic Memory Management for Serving LLMs without  PagedAttention

High throughput serving of large language models (LLMs) requires batching
sufficiently many requests at a time. However, existing systems struggle
because the key-value cache (KV cache) memory for each request is huge and
grows and shrinks dynamically. When managed inefficiently, this memory can be
significantly wasted by fragmentation and redundant duplication, limiting the
batch size. To address this problem, we propose PagedAttention, an attention
algorithm inspired by the classical virtual memory and paging techniques in
operating systems. On top of it, we build vLLM, an LLM serving system that
achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV
cache within and across requests to further reduce memory usage. Our
evaluations show that vLLM improves the throughput of popular LLMs by
2-4$\times$ with the same level of latency compared to the state-of-the-art
systems, such as FasterTransformer and Orca. The improvement is more pronounced
with longer sequences, larger models, and more complex decoding algorithms.
vLLM's source code is publicly available at
this https URL

大规模语言模型的高吞吐量通过批处理大量请求实现，本研究提出了 PagedAttention 算法和 vLLM 系统，用于减少关键值缓存（KV cache）内存的浪费和冗余复制，改善系统的吞吐量和内存利用率。