Domain incremental learning (DIL) poses a significant challenge in real-world
scenarios, as models need to be sequentially trained on diverse domains over
time, all the while avoiding catastrophic forgetting. Mitigating representation
drift, which refers to the phenomenon of learned representations undergoing
changes as the model adapts to new tasks, can help alleviate catastrophic
forgetting. In this study, we propose a novel DIL method named DARE, featuring
a three-stage training process: Divergence, Adaptation, and REfinement. This
process gradually adapts the representations associated with new tasks into the
feature space spanned by samples from previous tasks, simultaneously
integrating task-specific decision boundaries. Additionally, we introduce a
novel strategy for buffer sampling and demonstrate the effectiveness of our
proposed method, combined with this sampling strategy, in reducing
representation drift within the feature encoder. This contribution effectively
alleviates catastrophic forgetting across multiple DIL benchmarks. Furthermore,
our approach prevents sudden representation drift at task boundaries, resulting
in a well-calibrated DIL model that maintains the performance on previous
tasks.

本研究提出了一种名为 DARE 的新颖 DIL 方法，通过分歧、适应和细化的三阶段训练过程，逐步将与新任务相关的表示适应到由先前任务样本所覆盖的特征空间中，并同时整合任务特定的决策边界，有效减缓了特征编码器的表示漂移，降低了多个 DIL 基准下的灾难性遗忘，并且在任务边界处防止了突发的表示漂移，使得 DIL 模型的性能得到了良好校准，并保持了对先前任务的性能。

渐进分歧的无缝适应：一种新颖的领域增量学习方法

Gradual Divergence for Seamless Adaptation: A Novel Domain Incremental  Learning Method

In this paper, we uncover that Language Models (LMs), either encoder- or
decoder-based, can obtain new capabilities by assimilating the parameters of
homologous models without retraining or GPUs. Typically, new abilities of LMs
can be imparted by Supervised Fine-Tuning (SFT), reflected in the disparity
between fine-tuned and pre-trained parameters (i.e., delta parameters). We
initially observe that by introducing a novel operation called DARE (Drop And
REscale), most delta parameters can be directly set to zeros without affecting
the capabilities of SFT LMs and larger models can tolerate a higher proportion
of discarded parameters. Based on this observation, we further sparsify delta
parameters of multiple SFT homologous models with DARE and subsequently merge
them into a single model by parameter averaging. We conduct experiments on
eight datasets from the GLUE benchmark with BERT and RoBERTa. We also merge
WizardLM, WizardMath, and Code Alpaca based on Llama 2. Experimental results
show that: (1) The delta parameter value ranges for SFT models are typically
small, often within 0.005, and DARE can eliminate 99% of them effortlessly.
However, once the models are continuously pre-trained, the value ranges can
grow to around 0.03, making DARE impractical. We have also tried to remove
fine-tuned instead of delta parameters and find that a 10% reduction can lead
to drastically decreased performance (even to 0). This highlights that SFT
merely stimulates the abilities via delta parameters rather than injecting new
abilities into LMs; (2) DARE can merge multiple task-specific LMs into one LM
with diverse abilities. For instance, the merger of WizardLM and WizardMath
improves the GSM8K zero-shot accuracy of WizardLM from 2.2 to 66.3, retaining
its instruction-following ability while surpassing WizardMath's original 64.2
performance. Codes are available at this https URL

我们揭示了语言模型（LMs）可以通过吸收同类模型的参数而无需重新训练或使用图形处理器来获得新的能力。我们引入了一种名为 DARE（Drop And Rescale）的新操作，将绝大多数增量参数直接设为零，并可以将多个特定任务的 LM 合并为一个具有多样能力的 LM。