Large pre-trained models (LPMs), such as LLaMA and ViT-G, have shown exceptional performance across various tasks. Although parameter-efficient fine-tuning (PEFT) has emerged to cheaply fine-tune these large models on downstream tasks, their deployment is still hindered by the vast model scale and computational costs. Neural network pruning offers a solution for model compression by removing redundant parameters, but most existing methods rely on computing parameter gradients. However, obtaining the gradients is computationally prohibitive for LPMs, which necessitates the exploration of alternative approaches. To this end, we propose a unified framework for efficient fine-tuning and deployment of LPMs, termed LoRAPrune. We first design a PEFT-aware pruning criterion, which utilizes the values and gradients of Low-Rank Adaption (LoRA), rather than the gradients of pre-trained parameters for importance estimation. We then propose an iterative pruning procedure to remove redundant parameters while maximizing the advantages of PEFT. Thus, our LoRAPrune delivers an accurate, compact model for efficient inference in a highly cost-effective manner. Experimental results on various tasks demonstrate that our method achieves state-of-the-art results. For instance, in the VTAB-1k benchmark, LoRAPrune utilizes only 0.76% of the trainable parameters and outperforms magnitude and movement pruning methods by a significant margin, achieving a mean Top-1 accuracy that is 5.7% and 4.3% higher, respectively. Moreover, our approach achieves comparable performance to PEFT methods, highlighting its efficacy in delivering high-quality results while benefiting from the advantages of pruning.

本文提出了一个名为LoRAPrune的统一框架，旨在实现高性能的大规模预训练模型的高效微调和部署，其中使用了PEFT感知的剪枝标准和基于Low-Rank Adaption（LoRA）的梯度值和梯度的重要性估计，通过迭代剪枝过程以最大化PEFT的优点来删除冗余参数，实现了高精度和高压缩比的目标。实验结果表明，我们的方法在各个任务中都达到了最先进的结果，并且在VTAB-1k基准测试中，使用可训练参数的仅0.76％，产生的平均Top-1准确率比幅度和移动剪枝方法高5.7％和4.3％，在保留微调优点的同时实现与PEFT方法可比较的性能。

去枝遇见低秩参数高效微调