BriefGPT.xyz
Mar, 2024
GEAR: 基于高效 KV 缓存压缩的近无损低长度模型生成推断算法
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
HTML
PDF
Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu...
TL;DR
提出了GEAR,一种高效的KV缓存压缩框架,实现了几乎无损的高压缩比。相比其他方法,GEAR在减少峰值内存大小的同时,实现了高达2.38倍的吞吐量提升。
Abstract
Key-value (KV) caching has become the de-facto to accelerate generation speed for large
language models
(LLMs) inference. However, the growing
cache demand
with increasing sequence length has transformed LLM infe
→