We propose LLM-FP4 for quantizing both weights and activations in large
language models (LLMs) down to 4-bit floating-point values, in a post-training
manner. Existing post-training quantization (PTQ) solutions are primarily
integer-based and struggle with bit widths below 8 bits. Compared to integer
quantization, floating-point (FP) quantization is more flexible and can better
handle long-tail or bell-shaped distributions, and it has emerged as a default
choice in many hardware platforms. One characteristic of FP quantization is
that its performance largely depends on the choice of exponent bits and
clipping range. In this regard, we construct a strong FP-PTQ baseline by
searching for the optimal quantization parameters. Furthermore, we observe a
high inter-channel variance and low intra-channel variance pattern in
activation distributions, which adds activation quantization difficulty. We
recognize this pattern to be consistent across a spectrum of transformer models
designed for diverse tasks, such as LLMs, BERT, and Vision Transformer models.
To tackle this, we propose per-channel activation quantization and show that
these additional scaling factors can be reparameterized as exponential biases
of weights, incurring a negligible cost. Our method, for the first time, can
quantize both weights and activations in the LLaMA-13B to only 4-bit and
achieves an average score of 63.1 on the common sense zero-shot reasoning
tasks, which is only 5.8 lower than the full-precision model, significantly
outperforming the previous state-of-the-art by 12.7 points. Code is available
at: this https URL

我们提出了 LLM-FP4，在训练后将大型语言模型（LLM）的权重和激活量化为 4 位浮点数值。

LLM-FP4: 4 位浮点数量化变压器

LLM-FP4: 4-Bit Floating-Point Quantized Transformers

In the complex domain of large language models (LLMs), striking a balance
between computational efficiency and maintaining model quality is a formidable
challenge. Navigating the inherent limitations of uniform quantization,
particularly when dealing with outliers, and motivated by the launch of
NVIDIA's H100 hardware, this study delves into the viability of floating-point
(FP) quantization, particularly focusing on FP8 and FP4, as a potential
solution. Our comprehensive investigation reveals that for LLMs, FP8 activation
consistently outshines its integer (INT8) equivalent, with the performance edge
becoming more noticeable in models possessing parameters beyond one billion.
For weight quantization, our findings indicate that FP4 exhibits comparable, if
not superior, performance to INT4, simplifying deployment on FP-supported
hardware like H100. To mitigate the overhead from precision alignment caused by
the disparity between weights and activations, we propose two scaling
constraints for weight quantization that negligibly impact the performance
compared to the standard W4A8 model. We additionally enhance our quantization
methods by integrating the Low Rank Compensation (LoRC) strategy, yielding
improvements especially in smaller models. The results of our investigation
emphasize the immense potential of FP quantization for LLMs, paving the way for
high-efficiency deployment in resource-limited settings.

使用浮点量化在大型语言模型中表现出色，尤其是 FP8 和 FP4 的浮点数，在模型参数超过十亿时性能优势更加明显。对于权重量化来说，FP4 与 INT4 相比表现出可比、甚至更好的性能，简化了在支持 FP 的硬件上部署。对于通过权重和激活之间差异引起的精度对齐开销，我们提出了两个权重量化的缩放约束条件，对性能的影响微乎其微，与标准的 W4A8 模型相比。此外，我们还结合了低秩补偿（LoRC）策略来增强量化方法，特别适用于较小的模型。研究结果强调了浮点量化在大型语言模型中的巨大潜力，为资源受限环境中的高效部署铺平了道路。