Parameter quantization for Large Language Models (LLMs) has attracted
increasing attentions recently in reducing memory costs and improving
computational efficiency. Early approaches have been widely adopted. However,
the existing methods suffer from poor performance in low-bit (such as 2 to 3
bits) scenarios. In this paper, we present a novel and effective Column-Level
Adaptive weight Quantization (CLAQ) framework by introducing three different
types of adaptive strategies for LLM quantization. Firstly, a K-Means
clustering based algorithm is proposed that allows dynamic generation of
quantization centroids for each column of a parameter matrix. Secondly, we
design an outlier-guided adaptive precision search strategy which can
dynamically assign varying bit-widths to different columns. Finally, a dynamic
outlier reservation scheme is developed to retain some parameters in their
original float point precision, in trade off of boosted model performance.
Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2
and Yi demonstrate that our methods achieve the state-of-the-art results across
different bit settings, especially in extremely low-bit scenarios. Code will be
released soon.

该论文介绍了一种基于列级适应性权重量化（CLAQ）框架的参数量化方法，通过引入三种不同的自适应策略，可以在大规模语言模型中减少内存占用和提高计算效率。实验结果表明，在不同比特设置下，尤其是在极低比特情况下，该方法能够取得最先进的结果。