Large language models (LLMs) have demonstrated impressive abilities in
various domains while the inference cost is expensive. The state-of-the-art
methods use 2-bit quantization for mainstream LLMs. However, challenges still
exist: (1) Nonnegligible accuracy loss for 2-bit quantization. Weights are
quantized by groups, while the ranges of weights are large in some groups,
resulting in large quantization errors and nonnegligible accuracy loss (e.g.
>3% for Llama2-7b with 2-bit quantization in GPTQ and Greenbit). (2) Limited
accuracy improvement by adding 4-bit weights. Increasing 10% extra average bit
more 4-bit weights only leads to <0.5% accuracy improvement on a quantized
Llama2-7b. (3) Time-consuming dequantization operations on GPUs. The
dequantization operations lead to >50% execution time, hindering the potential
of reducing LLM inference cost. To tackle these challenges, we propose the
following techniques: (1) We only quantize a small fraction of groups with the
larger range using 4-bit with memory alignment consideration on GPUs. (2) We
point out that the distribution of the sparse outliers with larger weights is
different in 2-bit and 4-bit groups, and only a small fraction of outliers
require 16-bit quantization. Such design leads to >0.5% accuracy improvement
with <3% average increased bit for Llama2-7b. (3) We design the asynchronous
dequantization on GPUs, leading to up to 3.92X speedup. We conduct extensive
experiments on different model families and model sizes. We achieve 2.85-bit
for each weight and the end-to-end speedup for Llama2-7b is 1.74X over the
original model, and we reduce both runtime cost and hardware cost by up to
2.70X and 2.81X with less GPU requirements.

通过以较小的计算代价解决对大型语言模型（LLMs）进行量化和去量化操作时所面临的问题，我们提出了一种新的技术，并在不同模型和尺寸上进行了广泛实验，成功实现了每个权重的 2.85 位表示，模型的端到端加速比为 1.74 倍，同时降低了运行成本和硬件需求。