LoRA and its variants have become popular parameter-efficient fine-tuning
(PEFT) methods due to their ability to avoid excessive computational costs.
However, an accuracy gap often exists between PEFT methods and full fine-tuning
(FT), and this gap has yet to be systematically studied. In this work, we
introduce a method for selecting sparse sub-matrices that aim to minimize the
performance gap between PEFT vs. full fine-tuning (FT) while also reducing both
fine-tuning computational cost and memory cost. Our Sparse Matrix Tuning (SMT)
method begins by identifying the most significant sub-matrices in the gradient
update, updating only these blocks during the fine-tuning process. In our
experiments, we demonstrate that SMT consistently surpasses other PEFT baseline
(e.g. LoRA and DoRA) in fine-tuning popular large language models such as LLaMA
across a broad spectrum of tasks, while reducing the GPU memory footprint by
67% compared to FT. We also examine how the performance of LoRA and DoRA tends
to plateau and decline as the number of trainable parameters increases, in
contrast, our SMT method does not suffer from such issue.

通过选择稀疏子矩阵以减少计算资源开销和内存消耗，我们介绍了一种名为 Sparse Matrix Tuning (SMT) 的方法，用于填补参数有效微调（PEFT）与完全微调（FT）之间的性能差距，并在多个任务中展示了其超越了其他 PEFT 的基准方法（如 LoRA 和 DoRA），同时与 FT 相比，GPU 内存占用减少了 67%。

大型语言模型微调中的稀疏矩阵

Sparse Matrix in Large Language Model Fine-tuning

In value-based deep reinforcement learning with replay memories, the batch
size parameter specifies how many transitions to sample for each gradient
update. Although critical to the learning process, this value is typically not
adjusted when proposing new algorithms. In this work we present a broad
empirical study that suggests {\em reducing} the batch size can result in a
number of significant performance gains; this is surprising, as the general
tendency when training neural networks is towards larger batch sizes for
improved performance. We complement our experimental findings with a set of
empirical analyses towards better understanding this phenomenon.

在价值导向的深度强化学习中，回放记忆中的批大小参数指定了每次梯度更新要采样多少转换。尽管在提出新算法时通常不会调整此值，但它对于学习过程非常关键。在这项工作中，我们进行了一项广泛的实证研究，表明减小批大小可能导致许多显著的性能提升；这令人惊讶，因为训练神经网络时一般倾向于使用较大的批大小以获得改进的性能。我们通过一系列经验分析来补充我们的实验结果，以更好地理解这种现象。

小批次深度强化学习

Small batch deep reinforcement learning

An important class of problems involves training deep neural networks with
sparse prediction targets of very high dimension D. These occur naturally in
e.g. neural language models or the learning of word-embeddings, often posed as
predicting the probability of next words among a vocabulary of size D (e.g.
200,000). Computing the equally large, but typically non-sparse D-dimensional
output vector from a last hidden layer of reasonable dimension d (e.g. 500)
incurs a prohibitive O(Dd) computational cost for each example, as does
updating the $D \times d$ output weight matrix and computing the gradient
needed for backpropagation to previous layers. While efficient handling of
large sparse network inputs is trivial, the case of large sparse targets is
not, and has thus so far been sidestepped with approximate alternatives such as
hierarchical softmax or sampling-based approximations during training. In this
work we develop an original algorithmic approach which, for a family of loss
functions that includes squared error and spherical softmax, can compute the
exact loss, gradient update for the output weights, and gradient for
backpropagation, all in $O(d^{2})$ per example instead of $O(Dd)$, remarkably
without ever computing the D-dimensional output. The proposed algorithm yields
a speedup of up to $D/4d$ i.e. two orders of magnitude for typical sizes, for
that critical part of the computations that often dominates the training time
in this kind of network architecture.

本文提出了一种基于算法的方法，通过使用 loss 函数直接计算误差和梯度更新输出权重，而无需计算大维度向量，以实现高效地训练深度神经网络模型。

对于球形损失函数族的精确梯度更新，与输出大小无关的时间

Exact gradient updates in time independent of output size for the  spherical loss family

An important class of problems involves training deep neural networks with
sparse prediction targets of very high dimension D. These occur naturally in
e.g. neural language models or the learning of word-embeddings, often posed as
predicting the probability of next words among a vocabulary of size D (e.g. 200
000). Computing the equally large, but typically non-sparse D-dimensional
output vector from a last hidden layer of reasonable dimension d (e.g. 500)
incurs a prohibitive O(Dd) computational cost for each example, as does
updating the D x d output weight matrix and computing the gradient needed for
backpropagation to previous layers. While efficient handling of large sparse
network inputs is trivial, the case of large sparse targets is not, and has
thus so far been sidestepped with approximate alternatives such as hierarchical
softmax or sampling-based approximations during training. In this work we
develop an original algorithmic approach which, for a family of loss functions
that includes squared error and spherical softmax, can compute the exact loss,
gradient update for the output weights, and gradient for backpropagation, all
in O(d^2) per example instead of O(Dd), remarkably without ever computing the
D-dimensional output. The proposed algorithm yields a speedup of D/4d , i.e.
two orders of magnitude for typical sizes, for that critical part of the
computations that often dominates the training time in this kind of network
architecture.

该论文提出了一种针对大规模高维稀疏目标训练深度神经网络的算法，可以大大提高计算效率，减少更新权重和反向传播所需的计算时间。