Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers. We critically analyze the existing quantization approaches, identifying their limitations in balancing the accuracy and efficiency of the quantized LLMs. To advance beyond these limitations, we propose WKVQuant, a PTQ framework especially designed for quantizing weights and the key/value (KV) cache of LLMs. Specifically, we incorporates past-only quantization to improve the computation of attention. Additionally, we introduce two-dimensional quantization strategy to handle the distribution of KV cache, along with a cross-block reconstruction regularization for parameter optimization. Experiments show that WKVQuant achieves almost comparable memory savings to weight-activation quantization, while also approaching the performance of weight-only quantization.

该论文提出了一种针对大型语言模型的量化方法，即使用WKVQuant框架对权重和关键/值（KV）缓存进行量化，通过过去量化改进注意力计算，并引入二维量化策略处理KV缓存分布，结合跨块重构正则化进行参数优化，实验证明WKVQuant能够几乎实现与权重-激活量化相当的内存节省，并接近仅权重量化的性能。

WKVQuant：量化权重和键/值缓存以提升大型语言模型的性能