Large Language Models (LLMs) have shown remarkable comprehension abilities
but face challenges in GPU memory usage during inference, hindering their
scalability for real-time applications like chatbots. To accelerate inference,
we store computed keys and values (KV cache) in the GPU memory. Existing
methods study the KV cache compression to reduce memory by pruning the
pre-computed KV cache. However, they neglect the inter-layer dependency between
layers and huge memory consumption in pre-computation. To explore these
deficiencies, we find that the number of crucial keys and values that influence
future generations decreases layer by layer and we can extract them by the
consistency in attention weights. Based on the findings, we propose
PyramidInfer, a method that compresses the KV cache by layer-wise retaining
crucial context. PyramidInfer saves significant memory by computing fewer keys
and values without sacrificing performance. Experimental results show
PyramidInfer improves 2.2x throughput compared to Accelerate with over 54% GPU
memory reduction in KV cache.

通过压缩键值缓存并保留关键上下文，提出了一种名为 PyramidInfer 的方法，以提高大型语言模型在 GPU 内存使用和推理速度方面的可扩展性。实验结果显示 PyramidInfer 相比 Accelerate 方法，在增加 2.2 倍的吞吐量的同时减少了 54% 的 GPU 内存占用。

金字塔推理：金字塔 KV 缓存压缩用于高吞吐率 LLM 推理

PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM  Inference

Channel Pruning, widely used for accelerating Convolutional Neural Networks,
is an NP-hard problem due to the inter-layer dependency of channel redundancy.
Existing methods generally ignored the above dependency for computation
simplicity. To solve the problem, under the Bayesian framework, we here propose
a layer-wise Recursive Bayesian Pruning method (RBP). A new dropout-based
measurement of redundancy, which facilitate the computation of posterior
assuming inter-layer dependency, is introduced. Specifically, we model the
noise across layers as a Markov chain and target its posterior to reflect the
inter-layer dependency. Considering the closed form solution for posterior is
intractable, we derive a sparsity-inducing Dirac-like prior which regularizes
the distribution of the designed noise to automatically approximate the
posterior. Compared with the existing methods, no additional overhead is
required when the inter-layer dependency assumed. The redundant channels can be
simply identified by tiny dropout noise and directly pruned layer by layer.
Experiments on popular CNN architectures have shown that the proposed method
outperforms several state-of-the-arts. Particularly, we achieve up to
$\bf{5.0\times}$ and $\bf{2.2\times}$ FLOPs reduction with little accuracy loss
on the large scale dataset ILSVRC2012 for VGG16 and ResNet50, respectively.

提出一种递归贝叶斯剪枝方法（RBP）来加速卷积神经网络，在考虑层间依赖的情况下使用基于 dropout 的冗余度测量法，解决了传统方法忽略层间依赖的问题。实验表明，该方法在多种 CNN 体系结构上的性能优于现有方法，尤其在大规模数据集 ILSVRC2012 上，VGG16 和 ResNet50 中能达到最多 5 倍和 2.2 倍的 FLOPs 减少，且精度损失较小。