Training LLMs is expensive, and recent evidence indicates training all the
way to convergence is inefficient. In this paper, we investigate the ability of
a simple idea, checkpoint averaging along the trajectory of a training run to
improve the quality of models before they have converged. This approach incurs
no extra cost during training or inference. Specifically, we analyze the
training trajectories of Pythia LLMs with 1 to 12 billion parameters and
demonstrate that, particularly during the early to mid stages of training, this
idea accelerates convergence and improves both test and zero-shot
generalization. Loss spikes are a well recognized problem in LLM training; in
our analysis we encountered two instances of this in the underlying
trajectories, and both instances were mitigated by our averaging.
For a 6.9B parameter LLM, for example, our early weight averaging recipe can
save upto 4200 hours of GPU time, which corresponds to significant savings in
cloud compute costs.

通过运用检查点平均化方法来改进大型语言模型（LLMs）的质量，在不增加额外培训或推理成本的前提下，缩短训练时间并提高测试和零样本泛化能力。