Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling 4-bit activations for 1-bit LLMs. BitNet a4.8 employs a hybrid quantization and sparsification strategy to mitigate the quantization errors introduced by the outlier channels. Specifically, we utilize 4-bit activations for inputs to the attention and feed-forward network layers, while sparsifying intermediate states followed with 8-bit quantization. Extensive experiments demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs, while being faster in inference with enabling 4-bit (INT4/FP4) kernels. Additionally, BitNet a4.8 activates only 55% of parameters and supports 3-bit KV cache, further enhancing the efficiency of large-scale LLM deployment and inference.

本研究解决了1位大型语言模型（LLM）中推理成本高且性能下降的问题。通过引入BitNet a4.8，采用混合量化和稀疏化策略，实现在注意力和前馈网络层中使用4位激活，并对中间状态进行稀疏化，经过大量实验，证明其推理速度更快且与BitNet b1.58相当的性能，提高了大型LLM的效率。

BitNet a4.8：1位大型语言模型的4位激活