Large language models (LLMs) are routinely pre-trained on billions of tokens,
only to restart the process over again once new data becomes available. A much
cheaper and more efficient solution would be to enable the continual
pre-training of these models, i.e. updating pre-trained models with new data
instead of re-training them from scratch. However, the distribution shift
induced by novel data typically results in degraded performance on past data.
Taking a step towards efficient continual pre-training, in this work, we
examine the effect of different warm-up strategies. Our hypothesis is that the
learning rate must be re-increased to improve compute efficiency when training
on a new dataset. We study the warmup phase of models pre-trained on the Pile
(upstream data, 300B tokens) as we continue to pre-train on SlimPajama
(downstream data, 297B tokens), following a linear warmup and cosine decay
schedule. We conduct all experiments on the Pythia 410M language model
architecture and evaluate performance through validation perplexity. We
experiment with different pre-training checkpoints, various maximum learning
rates, and various warmup lengths. Our results show that while rewarming models
first increases the loss on upstream and downstream data, in the longer run it
improves the downstream performance, outperforming models trained from
scratch$\unicode{x2013}$even for a large downstream dataset.

这项研究考察了不同预热策略对大型语言模型的影响，发现重启模型预热可以提高下游性能，即使在大型下游数据集中也优于从头开始训练的模型。