Zeroth-order optimization (ZO) is a memory-efficient strategy for fine-tuning Large Language Models using only forward passes. However, the application of ZO fine-tuning in memory-constrained settings such as mobile phones and laptops is still challenging since full precision forward passes are infeasible. In this study, we address this limitation by integrating sparsity and quantization into ZO fine-tuning of LLMs. Specifically, we investigate the feasibility of fine-tuning an extremely small subset of LLM parameters using ZO. This approach allows the majority of un-tuned parameters to be quantized to accommodate the constraint of limited device memory. Our findings reveal that the pre-training process can identify a set of "sensitive parameters" that can guide the ZO fine-tuning of LLMs on downstream tasks. Our results demonstrate that fine-tuning 0.1% sensitive parameters in the LLM with ZO can outperform the full ZO fine-tuning performance, while offering wall-clock time speedup. Additionally, we show that ZO fine-tuning targeting these 0.1% sensitive parameters, combined with 4 bit quantization, enables efficient ZO fine-tuning of an Llama2-7B model on a GPU device with less than 8 GiB of memory and notably reduced latency.

本研究通过将稀疏性和量化技术整合到零阶优化（ZO）细调的大型语言模型（LLM）中，从而解决在内存受限环境（如移动电话和笔记本电脑）中使用ZO细调的挑战。研究结果表明，使用ZO对LLM进行0.1%敏感参数细调能优于全面细调，并同时提供加速的速度。此外，结合4位量化技术，ZO对Llama2-7B模型的高效细调在GPU设备上不到8 GB内存的限制下实现了显著降低的延迟。

零阶极度稀疏LLMs的微调