Post-training quantization (PTQ) techniques applied to weights, activations,
and the KV cache greatly reduce memory usage, latency, and power consumption of
Large Language Models (LLMs), but may lead to large quantization errors when
outliers are present. Recent findings suggest that rotating activation or
weight matrices helps remove outliers and benefits quantization. In this work,
we identify a collection of applicable rotation parameterizations that lead to
identical outputs in full-precision Transformer architectures, and find that
some random rotations lead to much better quantization than others, with an up
to 13 points difference in downstream zero-shot reasoning performance. As a
result, we propose SpinQuant that optimizes (or learns) the rotation matrices
with Cayley optimization on a small validation set. With 4-bit quantization of
weight, activation, and KV-cache, SpinQuant narrows the accuracy gap on
zero-shot reasoning tasks with full precision to merely 2.9 points on the
LLaMA-2 7B model, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0
points. SpinQuant also outperforms concurrent work QuaRot, which applies random
rotations to remove outliers. In particular, for LLaMA-2 7B/LLaMA-3 8B models
that are hard to quantize, SpinQuant reduces the gap to full precision by
30.2%/34.1% relative to QuaRot.

通过优化旋转参数，针对大型语言模型（LLMs）进行后训练量化（PTQ）可显著减少内存使用、延迟和功耗，并减小其量化误差。通过将随机旋转应用于 LLMs 中的激活和权重矩阵，SpinQuant 方法优化旋转矩阵来减小量化误差，对比其他方法提升了零样本推理性能，尤其在难以量化的模型上获得了显著提升。