Compared to the moderate size of neural network models, structural weight
pruning on the Large-Language Models (LLMs) imposes a novel challenge on the
efficiency of the pruning algorithms, due to the heavy computation/memory
demands of the LLMs. Recent efficient LLM pruning methods typically operate at
the post-training phase without the expensive weight finetuning, however, their
pruning criteria often rely on heuristically designed metrics, potentially
leading to suboptimal performance. We instead propose a novel
optimization-based structural pruning that learns the pruning masks in a
probabilistic space directly by optimizing the loss of the pruned model. To
preserve the efficiency, our method 1) works at post-training phase} and 2)
eliminates the back-propagation through the LLM per se during the optimization
(i.e., only requires the forward pass of the LLM). We achieve this by learning
an underlying Bernoulli distribution to sample binary pruning masks, where we
decouple the Bernoulli parameters from the LLM loss, thus facilitating an
efficient optimization via a policy gradient estimator without
back-propagation. As a result, our method is able to 1) operate at structural
granularities of channels, heads, and layers, 2) support global and
heterogeneous pruning (i.e., our method automatically determines different
redundancy for different layers), and 3) optionally use a metric-based method
as initialization (of our Bernoulli distributions). Extensive experiments on
LLaMA, LLaMA-2, and Vicuna using the C4 and WikiText2 datasets demonstrate that
our method operates for 2.7 hours with around 35GB memory for the 13B models on
a single A100 GPU, and our pruned models outperform the state-of-the-arts
w.r.t. perplexity. Codes will be released.

基于优化的结构剪枝方法通过在概率空间中学习剪枝掩码，通过前向传递和策略梯度估计器进行高效优化，实现对大型语言模型的剪枝，并在复杂性和效果方面超越现有方法。

大规模语言模型的优化结构裁剪方法

Optimization-based Structural Pruning for Large Language Models without  Back-Propagation

We derive an unbiased estimator for expectations over discrete random
variables based on sampling without replacement, which reduces variance as it
avoids duplicate samples. We show that our estimator can be derived as the
Rao-Blackwellization of three different estimators. Combining our estimator
with REINFORCE, we obtain a policy gradient estimator and we reduce its
variance using a built-in control variate which is obtained without additional
model evaluations. The resulting estimator is closely related to other gradient
estimators. Experiments with a toy problem, a categorical Variational
Auto-Encoder and a structured prediction problem show that our estimator is the
only estimator that is consistently among the best estimators in both high and
low entropy settings.

本文提出了一种基于无重复抽样的离散随机变量期望无偏估计方法，将其与 REINFORCE 算法相结合，得到了具有内置控制变量的策略梯度估计器，并应用于多种任务得到了良好的效果。

通过无放回抽样估计离散随机变量的梯度

Estimating Gradients for Discrete Random Variables by Sampling without  Replacement

Sequence generation models are commonly refined with reinforcement learning
over user-defined metrics. However, high gradient variance hinders the
practical use of this method. To stabilize this method, we adapt to contextual
generation of categorical sequences a policy gradient estimator, which
evaluates a set of correlated Monte Carlo (MC) rollouts for variance control.
Due to the correlation, the number of unique rollouts is random and adaptive to
model uncertainty; those rollouts naturally become baselines for each other,
and hence are combined to effectively reduce gradient variance. We also
demonstrate the use of correlated MC rollouts for binary-tree softmax models,
which reduce the high generation cost in large vocabulary scenarios by
decomposing each categorical action into a sequence of binary actions. We
evaluate our methods on both neural program synthesis and image captioning. The
proposed methods yield lower gradient variance and consistent improvement over
related baselines.

该研究提出了一种针对分类序列生成的策略梯度估计器 —— 基于相关性蒙特卡洛树的滚动策略梯度估计器，该方法通过生成一组相关的蒙特卡洛树来控制方差，从而有效地降低了梯度方差，同时可以缩短大词汇场景下分类的生成成本。