Deploying Large Language Models (LLMs) on resource-constrained (or weak) devices presents significant challenges due to limited resources and heterogeneous data distribution. To address the data concern, it is necessary to fine-tune LLMs using on-device private data for various downstream tasks. While Federated Learning (FL) offers a promising privacy-preserving solution, existing fine-tuning methods retain the original LLM size, leaving issues of high inference latency and excessive memory demands unresolved. Hence, we design FedSpine, an FL framework that combines Parameter- Efficient Fine-Tuning (PEFT) with structured pruning for efficient deployment of LLMs on resource-constrained devices. Specifically, FedSpine introduces an iterative process to prune and tune the parameters of LLMs. To mitigate the impact of device heterogeneity, an online Multi-Armed Bandit (MAB) algorithm is employed to adaptively determine different pruning ratios and LoRA ranks for heterogeneous devices without any prior knowledge of their computing and communication capabilities. As a result, FedSpine maintains higher inference accuracy while improving fine-tuning efficiency. Experimental results conducted on a physical platform with 80 devices demonstrate that FedSpine can speed up fine-tuning by 1.4$\times$-6.9$\times$ and improve final accuracy by 0.4%-4.5% under the same sparsity level compared to other baselines.

本研究针对大型语言模型在资源受限设备上部署时面临的高推理延迟和过高内存需求问题，提出了一种新的联邦学习框架FedSpine。通过结合参数高效微调和结构化剪枝，FedSpine能够在不知道设备计算和通信能力的情况下，自适应调整剪枝比率和LoRA等级，从而提高微调效率和推理准确度。实验结果显示，FedSpine相较于其他基线，能够提高微调速度1.4到6.9倍，同时在相同稀疏度水平下提升最终准确率0.4%到4.5%。

在资源受限设备上高效部署大型语言模型