Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. Code will be made available at https://github.com/yuhuixu1993/qa-lora.

我们提出了一种量化感知的低秩自适应算法（QA-LoRA），通过使用分组运算符，增加量化的自由度，减少自适应的自由度，将大型语言模型（LLMs）权重量化以减少时间和内存使用，并将LLM和辅助权重自然地集成到一个量化模型中，而不损失准确性。我们应用QA-LoRA算法于LLaMA和LLaMA2模型系列，并在不同的微调数据集和下游场景中验证了其有效性。

QA-LoRA：大型语言模型的量化感知低秩适应