In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.

在这项研究中，我们展示了通过增加量化维度可以显著改善神经网络量化的大小和准确性之间的权衡。我们提出了GPTVQ方法，一种新的快速方法，用于对大型语言模型（LLMs）进行训练后的向量量化（VQ），并在多列的量化和未量化权重更新之间交错使用每层输出重建MSE的Hessian信息。通过使用数据感知的EM算法的高效版本初始化码本，然后使用整数量化和基于SVD的压缩来进行进一步压缩。GPTVQ在诸如Llama-v2和Mistral之类的各种LLMs上建立了新的最先进的大小与准确性权衡状态。此外，我们的方法高效：在单个H100上处理Llamav2-70B模型需要3到11个小时，具体取决于量化设置。最后，通过对移动CPU上的VQ解压缩进行设备上的计时，我们显示VQ相比于使用4位整数格式可以提供改进的延迟。

GPTVQ: LLM量化的维度福祉