We propose a memory-efficient finetuning algorithm for large language models (LLMs) that supports finetuning LLMs with 65B parameters in 3-bit or 4-bit precision on as little as one 48GB GPU. Our method, modular low-rank adaptation (ModuLoRA), integrates any user-specified weight quantizer with finetuning via low-rank adapters (LoRAs). Our approach relies on a simple quantization-agnostic backward pass that adaptively materializes low-precision LLM weights from a custom black-box quantization module. This approach enables finetuning 3-bit LLMs for the first time--leveraging state-of-the-art 3-bit OPTQ quantization often outperforms finetuning that relies on less sophisticated 4-bit and 8-bit methods. In our experiments, ModuLoRA attains competitive performance on text classification, natural language infernece, and instruction following tasks using significantly less memory than existing approaches, and we also surpass the state-of-the-art ROUGE score on a popular summarization task. We release ModuLoRA together with a series of low-precision models--including the first family of 3-bit instruction following Alpaca LLMs--as part of LLMTOOLS, a user-friendly library for quantizing, running, and finetuning LLMs on consumer GPUs.

我们提出了一种内存高效的大型语言模型微调算法(ModuLoRA)，支持在只有一个48GB的GPU上以3位或4位精度对具有65B参数的语言模型进行微调。通过将任何用户指定的权重量化器与低秩适配器(LoRAs)结合使用，我们的方法通过简单的量化无关后向传递来适应动态生成来自自定义黑盒量化模块的低精度语言模型权重。在实验中，ModuLoRA在文本分类、自然语言推理和指令跟随任务上获得了有竞争力的性能，并且在使用比现有方法更少的内存的同时，我们还超过了流行的摘要任务的最新ROUGE分数。我们将ModuLoRA与一系列低精度模型一起发布，其中包括第一个3位指令跟随型Alpaca LLMs系列，作为LLMTOOLS的一部分，LLMTOOLS是一个用户友好的用于在消费级GPU上进行量化、运行和微调LLMs的库。

ModuLoRA: 将3位LLMs在消费级GPU上进行微调与模块量化器集成