Optimizing the Key-Value (KV) cache of the Large Language Model (LLM) has
been considered critical to saving the cost of inference. Most of the existing
KV-cache compression algorithms attempted to sparsify the sequence of tokens by
taking advantage of the different importance of tokens. In this work, we found
that by identifying the importance of attention layers, we could optimize the
KV-cache jointly from two dimensions. Based on our observations regarding
layer-wise importance in inference, we propose SqueezeAttention to precisely
optimize the allocation of KV-cache budget among layers on-the-fly and then
incorporate three representative token sparsification algorithms to compress
the KV-cache for each layer with its very own budget. By optimizing the
KV-cache from both sequence's and layer's dimensions, SqueezeAttention achieves
around 30% to 70% of the memory reductions and up to 2.2 times of throughput
improvements in a wide range of LLMs and benchmarks. The code is available at
this https URL

通过确定关注层的重要性，我们提出了 SqueezeAttention 来精确优化动态分配关键值缓存的预算，并结合三种代表性的标记稀疏化算法来压缩每个层的关键值缓存。通过从序列和层两个维度进行优化，SqueezeAttention 在各种大型语言模型和基准测试中实现了 30% 至 70% 的内存减少和最高 2.2 倍的吞吐量提升。