A primary function of back-propagation is to compute both the gradient of
hidden representations and parameters for optimization with gradient descent.
Training large models requires high computational costs due to their vast
parameter sizes. While Parameter-Efficient Fine-Tuning (PEFT) methods aim to
train smaller auxiliary models to save computational space, they still present
computational overheads, especially in Fine-Tuning as a Service (FTaaS) for
numerous users. We introduce Collaborative Adaptation (ColA) with Gradient
Learning (GL), a parameter-free, model-agnostic fine-tuning approach that
decouples the computation of the gradient of hidden representations and
parameters. In comparison to PEFT methods, ColA facilitates more cost-effective
FTaaS by offloading the computation of the gradient to low-cost devices. We
also provide a theoretical analysis of ColA and experimentally demonstrate that
ColA can perform on par or better than existing PEFT methods on various
benchmarks.

使用协作适应（ColA）和梯度学习（GL）的参数自由、模型不可知的微调方法在性能上与现有的参数高效微调方法相媲美甚至更好，在各种基准测试中 ColA 的计算性能更具成本效益，使得微调作为一项服务可通过将梯度计算卸载到低成本设备来实现。

ColA: 梯度学习的协作调整

ColA: Collaborative Adaptation with Gradient Learning

This paper proposes GProp, a deep reinforcement learning algorithm for
continuous policies with compatible function approximation. The algorithm is
based on two innovations. Firstly, we present a temporal-difference based
method for learning the gradient of the value-function. Secondly, we present
the deviator-actor-critic (DAC) model, which comprises three neural networks
that estimate the value function, its gradient, and determine the actor's
policy respectively. We evaluate GProp on two challenging tasks: a contextual
bandit problem constructed from nonparametric regression datasets that is
designed to probe the ability of reinforcement learning algorithms to
accurately estimate gradients; and the octopus arm, a challenging reinforcement
learning benchmark. GProp is competitive with fully supervised methods on the
bandit task and achieves the best performance to date on the octopus arm.

本研究提出一种新的深度强化学习算法 ——GProp，可用于连续动作策略的训练，算法基于在值函数的梯度上学习的时差法，并提出了包含三个神经网络的漂移器 — 演员 — 评论家（DAC）模型，分别估计值函数、梯度和确定演员策略。GProp 在两个挑战任务上进行了评估：从非参数回归数据集构建的情境赌博机问题，以及章鱼臂，其中在前者的表现上，GProp 和全监督方法不相上下，而在后者上取得了迄今为止最佳的表现。