In the domain of large language models (LLMs), arXiv:2305.16938 showed that
few-shot full-model fine-tuning -- namely Vanilla Fine Tuning (FT) and
Pattern-Based Fine Tuning (PBFT) --, and In-Context Learning (ICL) generalize
similarly on Out-Of-Domain (OOD) datasets, but vary in terms of task
adaptation. However, they both pose challenges, especially in term of memory
requirements. In this paper, we further try to push the understanding of
different fine-tuning strategies for LLM and aim to bring a myriad of these on
the same pedestal for an elaborate comparison with full-model fine-tuning on
two diverse datasets. To that end, we conducted a series of experiments,
beginning with state-of-the-art methods like vanilla fine-tuning and
Pattern-Based Fine-Tuning (PBFT) on pre-trained models across two datasets,
COLA and MNLI. We then investigate adaptive fine-tuning and the efficiency of
LoRA adapters in a few-shot setting. Finally, we also compare an alternative
approach that has gained recent popularity -- context distillation -- with the
vanilla FT and PBFT with and without few-shot setup.
Our findings suggest that these alternative strategies that we explored can
exhibit out-of-domain generalization comparable to that of vanilla FT and PBFT.
PBFT under-performs Vanilla FT on out-of-domain (OOD) data, emphasizing the
need for effective prompts. Further, our adaptive-fine tuning and LoRA
experiments perform comparable or slightly worse than the standard fine-tunings
as anticipated, since standard fine-tunings involve tuning the entire model.
Finally, our context distillation experiments out-perform the standard
fine-tuning methods. These findings underscore that eventually the choice of an
appropriate fine-tuning method depends on the available resources (memory,
compute, data) and task adaptability.

该研究探讨了大型语言模型的细调策略，发现可替代方法在领域外泛化方面与标准方法相媲美，强调了对有效提示的需求，并针对可用资源和任务适应性进行合适的细调方法选择。

大型语言模型（LLM）在低资源环境中不同有效微调方法的比较分析

Comparative Analysis of Different Efficient Fine Tuning Methods of Large  Language Models (LLMs) in Low-Resource Setting

Modern language models have the capacity to store and use immense amounts of
knowledge about real-world entities, but it remains unclear how to update their
implicit "knowledge bases.'' While prior methods for updating knowledge in LMs
successfully inject facts, updated LMs then fail to make inferences based on
these injected facts. In this work, we demonstrate that a context
distillation-based approach can both impart knowledge about entities and
propagate that knowledge to enable broader inferences. Our approach consists of
two stages: transfer set generation and distillation on the transfer set. We
first generate a transfer set by simply prompting a language model to generate
a continuation from the entity definition. Then, we update the model parameters
so that the distribution of the LM (the student) matches the distribution of
the LM conditioned on the definition (the teacher) on the transfer set. Our
experiments demonstrate that this approach is more effective in propagating
knowledge updates compared to fine-tuning and other gradient-based
knowledge-editing methods without compromising performance in other contexts,
even when injecting the definitions of up to 150 entities at once.

文章介绍了建立语言模型的知识库的更新方法，通过上下文蒸馏的方式对实体进行继承知识，以良好的效果进行更新，而不会影响到性能。

通过蒸馏将知识更新传递给 LM

Propagating Knowledge Updates to LMs Through Distillation

Language models significantly benefit from context tokens, such as prompts or
scratchpads. They perform better when prompted with informative instructions,
and they acquire new reasoning capabilities by generating a scratch-pad before
predicting the final answers. However, they do not \textit{internalize} these
performance gains, which disappear when the context tokens are gone. Our work
proposes to apply context distillation so that a language model can improve
itself by internalizing these gains. Concretely, given a synthetic unlabeled
input for the target task, we condition the model on ``[instructions] +
[task-input]'' to predict ``[scratch-pad] + [final answer]''; then we fine-tune
the same model to predict its own ``[final answer]'' conditioned on the
``[task-input]'', without seeing the ``[instructions]'' or using the
``[scratch-pad]''.
We show that context distillation is a general method to train language
models, and it can effectively internalize 3 types of training signals. First,
it can internalize abstract task instructions and explanations, so we can
iteratively update the model parameters with new instructions and overwrite old
ones. Second, it can internalize step-by-step reasoning for complex tasks
(e.g., 8-digit addition), and such a newly acquired capability proves to be
useful for other downstream tasks. Finally, it can internalize concrete
training examples, and it outperforms directly learning with gradient descent
by 9\% on the SPIDER Text-to-SQL dataset; furthermore, combining context
distillation operations can internalize more training examples than the context
window size allows.

本文提出了上下文蒸馏的方法，以内化语言模型通过上下文提示或草稿本获得的性能早期；该方法可以内化抽象任务说明，步骤推理以及具体训练示例，从而有效地训练语言模型。