In recent years, Large Language Models (LLMs) have made significant strides towards Artificial General Intelligence. However, training these models from scratch requires substantial computational resources and vast amounts of text data. In this paper, we explore an alternative approach to constructing an LLM for a new language by continually pretraining (CPT) from existing pretrained LLMs, instead of using randomly initialized parameters. Based on parallel experiments on 40 model sizes ranging from 40M to 5B parameters, we find that 1) CPT converges faster and saves significant resources in a scalable manner; 2) CPT adheres to an extended scaling law derived from Hoffmann et al. (2022) with a joint data-parameter scaling term; 3) The compute-optimal data-parameter allocation for CPT markedly differs based on our estimated scaling factors; 4) The effectiveness of transfer at scale is influenced by training duration and linguistic properties, while robust to data replaying, a method that effectively mitigates catastrophic forgetting in CPT. We hope our findings provide deeper insights into the transferability of LLMs at scale for the research community.

本文研究了以持续预训练（CPT）的方式构建新语言的大型语言模型（LLMs），并通过40个模型规模的并行实验表明：1）CPT能够快速收敛并以可扩展的方式节省大量计算资源；2）CPT遵循Hoffmann等人（2022）提出的扩展缩放定律，具有联合数据-参数缩放项；3）根据估计的扩展因子，CPT的计算最优数据-参数分配存在显著差异；4）在训练持续时间和语言属性的影响下，规模化的迁移效果可以通过数据重播的方法有效减轻灾难性遗忘。希望我们的发现对研究界在规模化LLMs的可迁移性方面提供深入的见解。

跨语言持续预训练在规模上的突破