Efficient use of GPU memory is essential for high throughput LLM inference.
Prior systems reserved memory for the KV-cache ahead-of-time, resulting in
wasted capacity due to internal fragmentation. Inspired by OS-based virtual
memory systems, vLLM proposed PagedAttention to enable dynamic memory
allocation for KV-cache. This approach eliminates fragmentation, enabling
high-throughput LLM serving with larger batch sizes. However, to be able to
allocate physical memory dynamically, PagedAttention changes the layout of
KV-cache from contiguous virtual memory to non-contiguous virtual memory. This
change requires attention kernels to be rewritten to support paging, and
serving framework to implement a memory manager. Thus, the PagedAttention model
leads to software complexity, portability issues, redundancy and inefficiency.
In this paper, we propose vAttention for dynamic KV-cache memory management.
In contrast to PagedAttention, vAttention retains KV-cache in contiguous
virtual memory and leverages low-level system support for demand paging, that
already exists, to enable on-demand physical memory allocation. Thus,
vAttention unburdens the attention kernel developer from having to explicitly
support paging and avoids re-implementation of memory management in the serving
framework. We show that vAttention enables seamless dynamic memory management
for unchanged implementations of various attention kernels. vAttention also
generates tokens up to 1.97x faster than vLLM, while processing input prompts
up to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention
and FlashInfer.

为了解决 GPU 内存的高吞吐量 LLM 推理的问题，该论文提出了 vAttention 的动态 KV-cache 内存管理方法，相较于 PagedAttention 模型，vAttention 在连续虚拟内存中保留 KV-cache，并利用现有的低层系统支持以实现按需物理内存分配。