Despite the remarkable success of Large Language Models (LLMs), the massive
size poses significant deployment challenges, particularly on
resource-constrained hardware. While existing LLM compression methods focus on
quantization, pruning remains relatively unexplored due to the high cost of
training-based approaches and data collection challenges. One-shot pruning
methods, although cost-effective and data-free, have become dominant in LLM
pruning, but lead to performance decline under the structured pruning setting.
In this work, we introduce a new paradigm for structurally pruning LLMs, called
Compresso. Our approach, through the collaboration of the proposed
resource-efficient pruning algorithm and the LLM itself, learns optimal pruning
decisions during the training process. Compresso addresses the challenges of
expensive training costs and data collection by incorporating Low-Rank
Adaptation (LoRA) into the $L_0$ regularization during the instruction tuning
process. Then, we further augment the pruning algorithm by introducing a
collaborative prompt that fosters collaboration between the LLM and the pruning
algorithm, significantly boosting the overall performance. To this end,
Compresso prunes LLaMA-7B to 5.4B, maintaining original performance and even
surpassing LLaMA-7B in reading comprehension by 2.62%. Extensive experiments
demonstrate that Compresso significantly outperforms one-shot pruning baselines
across various sparsity ratios, achieving up to 2.21%, 11.43%, 7.04%, and 4.81%
higher scores on the commonsense reasoning, reading comprehension, MMLU, and
BBH benchmarks, respectively.

通过合作的剪枝算法和大型语言模型自身，在数据收集和训练成本昂贵的挑战下，Compresso 通过在训练过程中学习最优的剪枝决策以及引入协同提示进一步增强了剪枝算法，成功将 LLaMA-7B 剪枝至 5.4B，并在阅读理解上超过 LLaMA-7B2.62%，在共同推理、阅读理解、MMLU 和 BBH 基准测试上分别获得了 2.21%、11.43%、7.04% 和 4.81% 的更高分数，明显优于一次性剪枝基线。

Compresso: 结构化剪枝与合作促使学习紧凑的大型语言模型

Compresso: Structured Pruning with Collaborative Prompting Learns  Compact Large Language Models

Random pruning is arguably the most naive way to attain sparsity in neural
networks, but has been deemed uncompetitive by either post-training pruning or
sparse training. In this paper, we focus on sparse training and highlight a
perhaps counter-intuitive finding, that random pruning at initialization can be
quite powerful for the sparse training of modern neural networks. Without any
delicate pruning criteria or carefully pursued sparsity structures, we
empirically demonstrate that sparsely training a randomly pruned network from
scratch can match the performance of its dense equivalent. There are two key
factors that contribute to this revival: (i) the network sizes matter: as the
original dense networks grow wider and deeper, the performance of training a
randomly pruned sparse network will quickly grow to matching that of its dense
equivalent, even at high sparsity ratios; (ii) appropriate layer-wise sparsity
ratios can be pre-chosen for sparse training, which shows to be another
important performance booster. Simple as it looks, a randomly pruned subnetwork
of Wide ResNet-50 can be sparsely trained to outperforming a dense Wide
ResNet-50, on ImageNet. We also observed such randomly pruned networks
outperform dense counterparts in other favorable aspects, such as
out-of-distribution detection, uncertainty estimation, and adversarial
robustness. Overall, our results strongly suggest there is larger-than-expected
room for sparse training at scale, and the benefits of sparsity might be more
universal beyond carefully designed pruning. Our source code can be found at
this https URL.

本研究探讨了如何在现代神经网络中使用稀疏训练，提出了初始化时的随机剪枝能够有效地提高神经网络的稀疏训练性能，结果表明此方法可以匹配对应的密集网络，达到了预期效果，并且进行适当的层级稀疏比率的选择，可以进一步提高性能。