Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in LLMs, specifically 4-bit weight and 8-bit activation (W4A8) quantization, to enhance computational efficiency -- a topic less explored compared to weight-only quantization. We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) to enhance PTQ by considering the combined effects on weights and activations and aligning calibration sequence lengths to target tasks. Moreover, we introduce dINT, a hybrid data format combining integer and denormal representations, to address the underflow issue in W4A8 quantization, where small values are rounded to zero. Through rigorous evaluations of LLMs, including OPT and LLaMA, we demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models. By developing arithmetic units compatible with dINT, we further confirm that our methods yield a 2$\times$ hardware efficiency improvement compared to 8-bit integer MAC unit.

该研究探讨了大型语言模型的后训练量化，特别是4位权重和8位激活（W4A8）量化，以提高计算效率，介绍了激活量化感知的缩放（AQAS）和序列长度感知的校准（SLAC）等创新技术，并引入了整数和非规格化表示的混合数据格式（dINT）来解决W4A8量化中的下溢问题，并通过对LLMs的严格评估证明这些技术显著提高了任务准确度，并且与完整精度模型相当，通过与dINT兼容的算术单元的开发，进一步证实了该方法相对于8位整数MAC单元可以提升2倍硬件效率。

通过权重和激活量化提升大型语言模型的计算效率