Language models (LMs) exhibit impressive performance and generalization
capabilities. However, LMs struggle with the persistent challenge of
catastrophic forgetting, which undermines their long-term sustainability in
continual learning (CL). Existing approaches usually address the issue by
incorporating old task data or task-wise inductive bias into LMs. However, old
data and accurate task information are often unavailable or costly to collect,
hindering the availability of current CL approaches for LMs. To address this
limitation, we introduce $\textbf{MIGU}$ ($\textbf{M}$agn$\textbf{I}$tude-based
$\textbf{G}$radient $\textbf{U}$pdating for continual learning), a
rehearsal-free and task-label-free method that only updates the model
parameters with large magnitudes of output in LMs' linear layers. MIGU is based
on our observation that the L1-normalized magnitude distribution of the output
in LMs' linear layers is different when the LM models deal with different task
data. By imposing this simple constraint on the gradient update process, we can
leverage the inherent behaviors of LMs, thereby unlocking their innate CL
abilities. Our experiments demonstrate that MIGU is universally applicable to
all three LM architectures (T5, RoBERTa, and Llama2), delivering
state-of-the-art or on-par performance across continual finetuning and
continual pre-training settings on four CL benchmarks. For example, MIGU brings
a 15.2% average accuracy improvement over conventional parameter-efficient
finetuning baselines in a 15-task CL benchmark. MIGU can also seamlessly
integrate with all three existing CL types to further enhance performance. Code
is available at \href{this https URL}{this https URL}.

通过 L1 归一化的输出幅度分布来约束梯度更新过程，我们提出了一种无需回放和任务标签的方法 MIGU（基于幅度的渐进学习梯度更新），以释放语言模型的内在连续学习能力。实验证明 MIGU 对于所有三种语言模型架构（T5，RoBERTa 和 Llama2）普遍适用，在四个连续学习基准测试中，在连续微调和连续预训练设置下，提供了最先进或不相上下的性能。