Scaling laws are useful guides for developing language models, but there are
still gaps between current scaling studies and how language models are
ultimately trained and evaluated. For instance, scaling is usually studied in
the compute-optimal training regime (i.e., "Chinchilla optimal" regime);
however, in practice, models are often over-trained to reduce inference costs.
Moreover, scaling laws mostly predict loss on next-token prediction, but
ultimately models are compared based on downstream task performance. In this
paper, we address both shortcomings. To do so, we create a testbed of 104
models with 0.011B to 6.9B parameters trained with various numbers of tokens on
three data distributions. First, we investigate scaling in the over-trained
regime. We fit scaling laws that extrapolate in both the number of model
parameters and the ratio of training tokens to parameters. This enables us to
predict the validation loss of a 1.4B parameter, 900B token run (i.e.,
32$\times$ over-trained) and a 6.9B parameter, 138B token
run$\unicode{x2014}$each from experiments that take 300$\times$ less compute.
Second, we relate the perplexity of a language model to its downstream task
performance via a power law. We use this law to predict top-1 error averaged
over downstream tasks for the two aforementioned models using experiments that
take 20$\times$ less compute. Our experiments are available at
this https URL

基于语言模型的缩放定律，本研究通过建立 104 个模型的测试平台，以不同数量的标记在三个数据分布上进行训练，研究了超过训练的情况下的缩放和语言模型的下游任务性能之间的关系。