We show that the majority of the inference computations for large generative models such as LLaMA and OPT can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups while at the same time maintaining good accuracy. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit, while keeping some outlier weights and activations in higher-precision. Crucially, our scheme is designed with computational efficiency in mind: we provide GPU kernels with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.1x relative to FP16 execution. Code and models are provided at https://github.com/IST-DASLab/QUIK.

大多数大型生成模型的推理计算可以通过将权重和激活值均转换为4位来加速计算，同时保持良好的准确性；我们通过名为QUIK的混合量化策略实现这一目标，该策略将大多数权重和激活值压缩为4位，将一些异常值保留在较高精度；关键是，我们的方案专注于计算效率，提供高效的逐层GPU内核，相对于FP16执行，端到端的吞吐量可提高最多3.1倍。

迈向端到端基于生成型大语言模型的4位推理