Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning
when a model forgets previously learned information as it learns new
information. As large language models (LLMs) have shown excellent performance,
it is interesting to uncover whether CF exists in the continual fine-tuning of
LLMs. In this study, we empirically evaluate the forgetting phenomenon in LLMs'
knowledge, from the perspectives of domain knowledge, reasoning, and reading
comprehension. The experiments demonstrate that catastrophic forgetting is
generally observed in LLMs ranging from 1b to 7b. Furthermore, as the scale
increases, the severity of forgetting also intensifies. Comparing the
decoder-only model BLOOMZ with the encoder-decoder model mT0, BLOOMZ suffers
less forgetting and maintains more knowledge. We also observe that LLMs can
mitigate language bias (e.g. gender bias) during continual fine-tuning.
Moreover, we find that ALPACA can maintain more knowledge and capacity compared
with LLAMA during the continual fine-tuning, which implies that general
instruction tuning can help mitigate the forgetting phenomenon of LLMs in the
further fine-tuning process.

大型语言模型在不断微调的过程中存在灾难性遗忘现象，尤其随着规模的增加，遗忘的严重程度也加剧，然而通过单独解码器模型 BLOOMZ 与编码器 - 解码器模型 mT0 的比较，发现 BLOOMZ 遗忘较少且保留更多知识，还观察到语言模型能够在不断微调中缓解语言偏见，同时通用指令微调有助于减轻大型语言模型在进一步微调过程中的遗忘现象。

大型语言模型在连续微调中的灾难性遗忘的实证研究

An Empirical Study of Catastrophic Forgetting in Large Language Models  During Continual Fine-tuning

Tangent Model Composition (TMC) is a method to combine component models
independently fine-tuned around a pre-trained point. Component models are
tangent vectors to the pre-trained model that can be added, scaled, or
subtracted to support incremental learning, ensembling, or unlearning.
Component models are composed at inference time via scalar combination,
reducing the cost of ensembling to that of a single model. TMC improves
accuracy by 4.2% compared to ensembling non-linearly fine-tuned models at a
2.5x to 10x reduction of inference cost, growing linearly with the number of
component models. Each component model can be forgotten at zero cost, with no
residual effect on the resulting inference. When used for continual
fine-tuning, TMC is not constrained by sequential bias and can be executed in
parallel on federated data. TMC outperforms recently published continual
fine-tuning methods almost uniformly on each setting -- task-incremental,
class-incremental, and data-incremental -- on a total of 13 experiments across
3 benchmark datasets, despite not using any replay buffer. TMC is designed for
composing models that are local to a pre-trained embedding, but could be
extended to more general settings.

Tangent Model Composition (TMC) 是一种将组件模型独立微调到预训练点周围并组合的方法，旨在支持增量学习、组合或取消学习，并在推理时通过标量组合组合组件模型，从而将集成成本降低到单个模型的成本以提高准确度，在 13 个实验和 3 个基准数据集上相对于非线性微调模型的组合在推理成本减少 2.5 倍到 10 倍的同时提高了 4.2% 的准确度，适用于增量微调、并行处理，无须重播缓冲区。

切线模型组合用于集成和持续微调

Tangent Model Composition for Ensembling and Continual Fine-tuning

Current LLMs have demonstrated remarkable capabilities in addressing users'
requests for various types of information. However, these models are limited by
the most recent data available in their pretraining corpora, rendering them
incapable of providing up-to-date information. Retraining LLMs from scratch is
cost-prohibitive, and the effectiveness of continual fine-tuning on new corpora
has not been thoroughly examined. Additionally, current update procedures
typically demand significant human input to prepare the information into more
structured format, such as knowledge triples, conversational data or responses
with human feedback. In this study, we conduct a comprehensive examination of a
novel self information update task in LLMs, which only requires the provision
of informative text corpora. For instance, we can use the latest news articles
to update the LLMs' existing knowledge. We define the self information update
task and assess the continual fine-tuning approach for this purpose. We observe
that the naive method of continual fine-tuning can be problematic due to LLMs'
exposure bias, which prioritizes existing information over new information we
aim to integrate and leads to incorrect reasoning chains that ultimately
diminish the efficacy of information updates. Based on our analysis, we propose
an effective method to mitigate exposure bias by incorporating the selection of
relevant facts into training losses. Furthermore, we develop a dataset to
evaluate information updates, derived from news articles published after March
2023. Experimental results demonstrate that our proposed approach significantly
increases the factual consistency score (0 to 1) by 0.16 while having minimal
impact on performance for instructions not directly related to the new
information.

本文对 LLMs 的自我信息更新任务进行了全面的研究，并评估了其持续微调方法。作者发现，普通的持续微调方法可能存在暴露偏差问题。因此，他们提出了一种有效的方法来缓解这个问题，进一步开发了新闻文章数据集来评估信息更新。实验结果表明，所提出的方法能显著提高事实一致性分数（0 到 1）0.16，对与新信息不直接相关的指令的性能几乎没有影响。