Pretrained Language Models (PLMs) have become the de facto starting point for
fine-tuning on downstream tasks. However, as model sizes continue to increase,
traditional fine-tuning of all parameters becomes challenging. To address this,
parameter-efficient fine-tuning (PEFT) methods have gained popularity as a
means to adapt PLMs effectively. In parallel, recent studies have revealed the
presence of activation sparsity within the intermediate outputs of the
multilayer perception (MLP) blocks in transformers. Low activation density
enables efficient model inference on sparsity-aware hardware. Building upon
this insight, in this work, we propose a novel density loss that encourages
higher activation sparsity (equivalently, lower activation density) in the
pre-trained models. We demonstrate the effectiveness of our approach by
utilizing mainstream PEFT techniques including QLoRA, LoRA, Adapter,
Prompt/Prefix Tuning to facilitate efficient model adaptation across diverse
downstream tasks. Experiments show that our proposed method DEFT,
Density-Efficient Fine-Tuning, can reduce the activation density consistently
and up to $\boldsymbol{50.72\%}$ on RoBERTa$_\mathrm{Large}$, and $\boldsymbol
{53.19\%}$ (encoder density) and $\boldsymbol{90.60\%}$ (decoder density) on
Flan-T5$_\mathrm{XXL}$ ($\boldsymbol{11B}$) compared to PEFT using GLUE and QA
(SQuAD) benchmarks respectively while maintaining competitive performance on
downstream tasks. We also showcase that DEFT works complementary with quantized
and pruned models

本研究提出了一种新的密度损失方法，促进预训练模型中更高的激活稀疏性，从而实现有效的模型自适应。实验证明，使用我们的方法 DEFT 在不降低下游任务性能的情况下，可以在 RoBERTa_Large 上减少激活密度达到 50.72％，在 Flan-T5_XXL（11B）上分别减少编码器密度为 53.19％，解码器密度为 90.60％，相较于使用 GLUE 和 QA（SQuAD）基准的 PEFT。我们还展示 DEFT 可以与量化和修剪模型互补使用。

从 PEFT 到 DEFT：在 Transformer 中减少激活密度的参数高效微调

From PEFT to DEFT: Parameter Efficient Finetuning for Reducing  Activation Density in Transformers

The advent of high-capacity pre-trained models has revolutionized
problem-solving in computer vision, shifting the focus from training
task-specific models to adapting pre-trained models. Consequently, effectively
adapting large pre-trained models to downstream tasks in an efficient manner
has become a prominent research area. Existing solutions primarily concentrate
on designing lightweight adapters and their interaction with pre-trained
models, with the goal of minimizing the number of parameters requiring updates.
In this study, we propose a novel Adapter Re-Composing (ARC) strategy that
addresses efficient pre-trained model adaptation from a fresh perspective. Our
approach considers the reusability of adaptation parameters and introduces a
parameter-sharing scheme. Specifically, we leverage symmetric
down-/up-projections to construct bottleneck operations, which are shared
across layers. By learning low-dimensional re-scaling coefficients, we can
effectively re-compose layer-adaptive adapters. This parameter-sharing strategy
in adapter design allows us to significantly reduce the number of new
parameters while maintaining satisfactory performance, thereby offering a
promising approach to compress the adaptation cost. We conduct experiments on
24 downstream image classification tasks using various Vision Transformer
variants to evaluate our method. The results demonstrate that our approach
achieves compelling transfer learning performance with a reduced parameter
count. Our code is available at
\href{https://github.com/DavidYanAnDe/ARC}{this https URL}.

高容量预训练模型的出现改变了计算机视觉中的问题解决方式，专注于训练特定任务的模型转变为调整预训练模型，因而有效地将大型预训练模型适应下游任务成为一个重要的研究领域；本研究提出了一种新颖的适配器重组（ARC）策略，从新的角度解决了高效预训练模型的适应问题，通过考虑适应参数的可重用性并引入参数共享方案，通过利用对称的下 -/ 上映射构建瓶颈操作从而实现层间参数共享，通过学习低维度的重新缩放系数，可以有效地重新组合层自适应的适配器，这种参数共享策略在适配器设计中允许显著减少新参数数量，同时保持令人满意的性能，从而提供了一种有前景的压缩适应成本的方法，通过在 24 个下游图像分类任务上使用各种 Vision Transformer 变种进行实验以评估我们的方法，结果表明我们的方法在减少参数数量的同时实现了令人信服的迁移学习性能。