Large language models (LLMs) show excellent performance in difficult tasks,
but they often require massive memories and computational resources. How to
reduce the parameter scale of LLMs has become research hotspots. In this study,
we make an important observation that the multi-head self-attention (MHA)
sub-layer of Transformer exhibits noticeable low-rank structure, while the
feed-forward network (FFN) sub-layer does not. With this regard, we design a
mixed compression model, which organically combines Low-Rank matrix
approximation And structured Pruning (LoRAP). For the MHA sub-layer, we propose
an input activation weighted singular value decomposition method to strengthen
the low-rank characteristic. Furthermore, we discover that the weight matrices
in MHA sub-layer have different low-rank degrees. Thus, a novel parameter
allocation scheme according to the discrepancy of low-rank degrees is devised.
For the FFN sub-layer, we propose a gradient-free structured channel pruning
method. During the pruning, we get an interesting finding that the least
important 1% of parameter actually play a vital role in model performance.
Extensive evaluations on zero-shot perplexity and zero-shot task classification
indicate that our proposal is superior to previous structured compression
rivals under multiple compression ratios.

本研究提出了一种混合压缩模型 LoRAP，通过输入激活加权奇异值分解方法和基于低秩度差异的参数分配方案，增强了 Transformer 模型中 Multi-Head Self-Attention 子层的低秩特性，并提出了无梯度的结构化通道剪枝方法用于 Feed-Forward Network 子层，实验证明我们的提议在多重压缩比下优于之前的结构化压缩方法。