Low Rank Adaptation (LoRA) has gained massive attention in the recent
generative AI research. One of the main advantages of LoRA is its ability to be
fused with pretrained models adding no overhead during inference. However, from
a mobile deployment standpoint, we can either avoid inference overhead in the
fused mode but lose the ability to switch adapters rapidly, or suffer
significant (up to 30% higher) inference latency while enabling rapid switching
in the unfused mode. LoRA also exhibits concept-loss when multiple adapters are
used concurrently. In this paper, we propose Sparse High Rank Adapters (SHiRA),
a new paradigm which incurs no inference overhead, enables rapid switching, and
significantly reduces concept-loss. Specifically, SHiRA can be trained by
directly tuning only 1-2% of the base model weights while leaving others
unchanged. This results in a highly sparse adapter which can be switched
directly in the fused mode. We further provide theoretical and empirical
insights on how high sparsity in SHiRA can aid multi-adapter fusion by reducing
concept loss. Our extensive experiments on LVMs and LLMs demonstrate that
finetuning only a small fraction of the parameters in the base model is
sufficient for many tasks while enabling both rapid switching and multi-adapter
fusion. Finally, we provide a latency- and memory-efficient SHiRA
implementation based on Parameter-Efficient Finetuning (PEFT) Library. This
implementation trains at nearly the same speed as LoRA while consuming lower
peak GPU memory, thus making SHiRA easy to adopt for practical use cases.

本文提出了基于稀疏高秩适配器 (SHiRA) 的新范式，通过直接调整基模型权重的 1-2% 来训练高度稀疏的适配器，以在融合模式下实现无推理开销、快速切换和显著降低概念损失的效果。对 LVMs 和 LLMs 的广泛实验表明，仅微调基模型的一小部分参数对许多任务已经足够，并且可以同时实现快速切换和多适配器融合。

稀疏高秩适配器

Sparse High Rank Adapters

Non-parametric neural language models (NLMs) learn predictive distributions
of text utilizing an external datastore, which allows them to learn through
explicitly memorizing the training datapoints. While effective, these models
often require retrieval from a large datastore at test time, significantly
increasing the inference overhead and thus limiting the deployment of
non-parametric NLMs in practical applications. In this paper, we take the
recently proposed $k$-nearest neighbors language model (Khandelwal et al.,
2020) as an example, exploring methods to improve its efficiency along various
dimensions. Experiments on the standard WikiText-103 benchmark and
domain-adaptation datasets show that our methods are able to achieve up to a 6x
speed-up in inference speed while retaining comparable performance. The
empirical analysis we present may provide guidelines for future research
seeking to develop or deploy more efficient non-parametric NLMs.

本文探讨了如何提高非参数神经语言模型的效率，实验表明我们的方法能够在保持性能相当的情况下提高 6 倍的推理速度，为以后开发或部署更有效的非参数神经语言模型提供指南。