Quantization has become a crucial step for the efficient deployment of deep neural networks, where floating point operations are converted to simpler fixed point operations. In its most naive form, it simply consists in a combination of scaling and rounding transformations, leading to either a limited compression rate or a significant accuracy drop. Recently, Gradient-based post-training quantization (GPTQ) methods appears to be constitute a suitable trade-off between such simple methods and more powerful, yet expensive Quantization-Aware Training (QAT) approaches, particularly when attempting to quantize LLMs, where scalability of the quantization process is of paramount importance. GPTQ essentially consists in learning the rounding operation using a small calibration set. In this work, we challenge common choices in GPTQ methods. In particular, we show that the process is, to a certain extent, robust to a number of variables (weight selection, feature augmentation, choice of calibration set). More importantly, we derive a number of best practices for designing more efficient and scalable GPTQ methods, regarding the problem formulation (loss, degrees of freedom, use of non-uniform quantization schemes) or optimization process (choice of variable and optimizer). Lastly, we propose a novel importance-based mixed-precision technique. Those guidelines lead to significant performance improvements on all the tested state-of-the-art GPTQ methods and networks (e.g. +6.819 points on ViT for 4-bit quantization), paving the way for the design of scalable, yet effective quantization methods.

量化方法在深度神经网络的高效部署中变得至关重要，深度神经网络经常需要量化以便在计算中使用固定点操作代替浮点操作。本文探讨了一种基于梯度的后训练量化方法（GPTQ），证明了该方法在选择权重、特征增强、校准集等方面具有一定鲁棒性，并提出了设计更高效、可扩展的GPTQ方法的准则，最后还提出了一种基于重要性的混合精度技术，这些准则和技术共同促进了已有的GPTQ方法和网络的性能改进，为设计可扩展且有效的量化方法开辟了新的可能。

基于梯度的训练后量化: 对现状的挑战