Due to their large size, generative Large Language Models (LLMs) require significant computing and storage resources. This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed by expressing the weight of LLM in 3bit/2bit. Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting. Therefore, GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding. A re-explore strategy is proposed to optimize initial scaling factor. During inference, these steps are merged into pure binary coding, enabling efficient computation. Testing across various models and datasets confirms GPTQT's effectiveness. Compared to the strong 3-bit quantization baseline, GPTQT further reduces perplexity by 4.01 on opt-66B and increases speed by 1.24 times on opt-30b. The results on Llama2 show that GPTQT is currently the best binary coding quantization method for such kind of LLMs.

该研究介绍了一种新的后训练量化方法GPTQT，通过以3位/2位表示LLM的权重，以减少内存使用并增强处理速度。经过测试，与强3位量化基准相比，GPTQT在opt-66B上进一步降低了困惑度4.01，并在opt-30b上提高了1.24倍的速度，说明GPTQT是目前针对此类LLMs的最佳二进制编码量化方法。

GPTQT：将大型语言模型量化两次以提高效率