Large Language Models (LLMs) have demonstrated exceptional performance in automating various tasks, such as text generation and summarization. Currently LLMs are trained and fine-tuned on large cloud server. Deploying and fine-tuning these models on resource-constrained edge devices remains a significant challenge due to their substantial memory and computational requirements. This paper introduces a resource-efficient zeroth-order optimization approach that lowers the barriers for fine-tuning LLMs in such constrained environments. Our method features a parallelized randomized gradient estimation (P-RGE) technique, which performs gradient estimation with high parallel efficiency. P-RGE leverages outer-loop and inner-loop parallelization to perform multiple function queries and forward passes in parallel, reducing the wall-clock end-to-end training time. By integrating this technique with parameter-efficient fine-tuning methods (e.g., LoRA) and on-device inference engines (e.g., ExecuTorch), we demonstrate efficient fine-tuning of LLMs on both server-side and edge devices. Experiments show that P-RGE achieves significant runtime speedups and memory savings while maintaining fine-tuning accuracy, which paves the way for more practical deployment of LLMs in real-time, on-device applications.

本研究解决了在资源受限的边缘设备上微调大型语言模型（LLMs）的难题。论文提出了一种资源高效的零阶优化方法，并引入了并行随机梯度估计（P-RGE）技术，显著降低了微调所需的时间和内存消耗。实验结果表明，该方法在保留微调精度的同时，提升了运行速度，推动了LLMs在实时设备端应用中的实际部署。 

利用仅有推理引擎实现资源高效的设备端小型语言模型微调