As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computationalast layer demands of these modern architectures while maintaining the accuracy. In this paper, we present TEQ, a trainable equivalent transformation that preserves the FP32 precision of the model output while taking advantage of low-precision quantization, especially 3 and 4 bits weight-only quantization. The training process is lightweight, requiring only 1K steps and fewer than 0.1 percent of the original model's trainable parameters. Furthermore, the transformation does not add any computational overhead during inference. Our results are on-par with the state-of-the-art (SOTA) methods on typical LLMs. Our approach can be combined with other methods to achieve even better performance. The code is available at https://github.com/intel/neural-compressor.

这篇论文介绍了一种可训练的等价转换方法，能够在保持模型输出的FP32精度的情况下，利用低精度量化，特别是3位和4位的权重量化来满足现代架构的计算需求，该方法在训练过程中轻量级且对推断过程没有计算开销，与当前最先进方法的结果相媲美，并可与其他方法结合以获得更好的性能。

可训练的等效转换：用于LLMs的量化