Recent studies introduced effective compression techniques for Large Language Models (LLMs) via post-training quantization or low-bit weight representation. Although quantized weights offer storage efficiency and allow for faster inference, existing works have indicated that quantization might compromise performance and exacerbate biases in LLMs. This study investigates the confidence and calibration of quantized models, considering factors such as language model type and scale as contributors to quantization loss. Firstly, we reveal that quantization with GPTQ to 4-bit results in a decrease in confidence regarding true labels, with varying impacts observed among different language models. Secondly, we observe fluctuations in the impact on confidence across different scales. Finally, we propose an explanation for quantization loss based on confidence levels, indicating that quantization disproportionately affects samples where the full model exhibited low confidence levels in the first place.

近期的研究引入了通过事后训练量化或低位权重表示来进行大语言模型（LLMs）有效压缩的技术。尽管量化权重提供了存储效率和更快推理的优势，但现有研究指出，量化可能损害性能并加剧LLMs中的偏见。本研究通过考虑语言模型类型和规模等因素，调查了量化模型的置信度和校准情况，揭示了通过GPTQ进行4位量化会导致对真实标签置信度的降低，同时不同语言模型之间观察到的影响差异。另外，本研究观察到在不同规模下对置信度影响的波动。最后，我们提出了一种基于置信度水平的量化损失解释，表明量化不成比例地影响了一开始完整模型置信度较低的样本。

大型语言模型的量化对置信度的影响