The efficient compression of large language models (LLMs) is becoming
increasingly popular. However, recovering the accuracy of compressed LLMs is
still a major challenge. Structural pruning with standard Low-Rank Adaptation
(LoRA) is a common technique in current LLM compression. In structural pruning,
the model architecture is modified unevenly, resulting in suboptimal
performance in various downstream tasks via standard LoRA with fixed rank. To
address this problem, we introduce RankAdaptor, an efficient fine-tuning method
with hierarchical dynamic rank scheduling for pruned LLMs. An end-to-end
automatic optimization flow is developed that utilizes a lightweight
performance model to determine the different ranks during fine-tuning.
Comprehensive experiments on popular benchmarks show that RankAdaptor
consistently outperforms standard LoRA with structural pruning over different
pruning settings. Without increasing the trainable parameters, RankAdaptor
further reduces the accuracy performance gap between the recovery of the pruned
model and the original model compared to standard LoRA.

采用 RankAdaptor 的分层动态秩调度方法，有效地微调剪枝的大型语言模型 (LLM)，在不增加训练参数的情况下，进一步减小剪枝模型恢复精度与原始模型之间的性能差距。

RankAdaptor: 针对结构修剪的层次动态低秩适应性建模

RankAdaptor: Hierarchical Dynamic Low-Rank Adaptation for Structural  Pruned LLMs

Meta's LLaMA family has become one of the most powerful open-source Large
Language Model (LLM) series. Notably, LLaMA3 models have recently been released
and achieve impressive performance across various with super-large scale
pre-training on over 15T tokens of data. Given the wide application of low-bit
quantization for LLMs in resource-limited scenarios, we explore LLaMA3's
capabilities when quantized to low bit-width. This exploration holds the
potential to unveil new insights and challenges for low-bit quantization of
LLaMA3 and other forthcoming LLMs, especially in addressing performance
degradation problems that suffer in LLM compression. Specifically, we evaluate
the 10 existing post-training quantization and LoRA-finetuning methods of
LLaMA3 on 1-8 bits and diverse datasets to comprehensively reveal LLaMA3's
low-bit quantization performance. Our experiment results indicate that LLaMA3
still suffers non-negligent degradation in these scenarios, especially in
ultra-low bit-width. This highlights the significant performance gap under low
bit-width that needs to be bridged in future developments. We expect that this
empirical study will prove valuable in advancing future models, pushing the
LLMs to lower bit-width with higher accuracy for being practical. Our project
is released on this https URL and quantized
LLaMA3 models are released in this https URL

LLaMA3 在低位量化方面存在明显的性能下降问题，需要在未来的发展中弥合低位宽度下的性能差距，此经验研究对于推进未来模型的发展非常有价值。

低位量化的 LLaMA3 模型效果如何？实证研究

How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study

The advancements in Large Language Models (LLMs) have been hindered by their
substantial sizes, which necessitate LLM compression methods for practical
deployment. Singular Value Decomposition (SVD) offers a promising solution for
LLM compression. However, state-of-the-art SVD-based LLM compression methods
have two key limitations: truncating smaller singular values may lead to higher
compression loss, and the lack of update on the remaining model parameters
after SVD truncation. In this work, we propose SVD-LLM, a new SVD-based LLM
compression method that addresses the limitations of existing methods. SVD-LLM
incorporates a truncation-aware data whitening strategy to ensure a direct
mapping between singular values and compression loss. Moreover, SVD-LLM adopts
a layer-wise closed-form model parameter update strategy to compensate for
accuracy degradation caused by SVD truncation. We evaluate SVD-LLM on a total
of 11 datasets and seven models from three different LLM families at four
different scales. Our results demonstrate the superiority of SVD-LLM over
state-of-the-arts, especially at high model compression ratios. The source code
is available at this https URL

提出了一种新的基于奇异值分解的大型语言模型压缩方法 SVD-LLM，它解决了现有方法的限制，并在高模型压缩比下展现了优越性能。

SVD-LLM：大型语言模型压缩的截断感知奇异值分解

SVD-LLM: Truncation-aware Singular Value Decomposition for Large  Language Model Compression

Compressing large language models (LLMs), often consisting of billions of
parameters, provides faster inference, smaller memory footprints, and enables
local deployment. Two standard compression techniques are pruning and
quantization, with the former eliminating redundant connections in model layers
and the latter representing model parameters with fewer bits. The key tradeoff
is between the degree of compression and the impact on the quality of the
compressed model. Existing research on LLM compression primarily focuses on
performance in terms of general metrics like perplexity or downstream task
accuracy. More fine-grained metrics, such as those measuring parametric
knowledge, remain significantly underexplored. To help bridge this gap, we
present a comprehensive analysis across multiple model families (ENCODER,
ENCODER-DECODER, and DECODER) using the LAMA and LM-HARNESS benchmarks in order
to systematically quantify the effect of commonly employed compression
techniques on model performance. A particular focus is on tradeoffs involving
parametric knowledge, with the goal of providing practitioners with practical
insights to help make informed decisions on compression. We release our
codebase1 to enable further research.

压缩大型语言模型（LLM）包含数十亿参数，可以提供更快的推理速度，更小的内存占用，并支持本地部署。我们通过对多个模型系列（ENCODER、ENCODER-DECODER 和 DECODER）使用 LAMA 和 LM-HARNESS 基准进行全面分析，以系统量化常用压缩技术对模型性能的影响，特别关注涉及参数化知识的权衡，旨在为从业人员提供实用的见解，帮助他们在压缩决策时做出明智的选择。