Pruning large language models (LLMs) from the BERT family has emerged as a standard compression benchmark, and several pruning methods have been proposed for this task. The recent ``Sparsity May Cry'' (SMC) benchmark put into question the validity of all existing methods, exhibiting a more complex setup where many known pruning methods appear to fail. We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets, and propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark. First, we perform a cost-vs-benefits analysis of pruning model components, such as the embeddings and the classification head; second, we provide a simple-yet-general way of scaling training, sparsification and learning rate schedules relative to the desired target sparsity; finally, we investigate the importance of proper parametrization for Knowledge Distillation in the context of LLMs. Our simple insights lead to state-of-the-art results, both on classic BERT-pruning benchmarks, as well as on the SMC benchmark, showing that even classic gradual magnitude pruning (GMP) can yield competitive results, with the right approach.

在 BERT 模型的剪枝过程中，我们提出了一组成功剪枝的通用指南，包括与目标稀疏度相关的训练、稀疏化和学习率调整调度的简单方法，以及在 LLM 上进行知识蒸馏时适当参数化的重要性，这些简单的洞察力使我们在经典 BERT 剪枝基准和 SMC 基准上取得了最先进的结果，表明即使是经典的渐进磁度剪枝方法也可以以正确的方法得到竞争性的结果。

剪枝语言模型：重现”稀疏可能扬声器“基准上的准确性