Scaling laws in language modeling traditionally quantify training loss as a function of dataset size and model parameters, providing compute-optimal estimates but often neglecting the impact of data quality on model generalization. In this paper, we extend the conventional understanding of scaling law by offering a microscopic view of data quality within the original formulation -- effective training tokens -- which we posit to be a critical determinant of performance for parameter-constrained language models. Specifically, we formulate the proposed term of effective training tokens to be a combination of two readily-computed indicators of text: (i) text diversity and (ii) syntheticity as measured by a teacher model. We pretrained over $200$ models of 25M to 1.5B parameters on a diverse set of sampled, synthetic data, and estimated the constants that relate text quality, model size, training tokens, and eight reasoning task accuracy scores. We demonstrated the estimated constants yield +0.83 Pearson correlation with true accuracies, and analyzed it in scenarios involving widely-used data techniques such as data sampling and synthesis which aim to improve data quality.

本研究解决了传统语言模型扩展规律忽视数据质量对模型泛化能力影响的问题。提出通过“有效训练tokens”的新视角，将文本多样性和合成度作为衡量指标，对200多个参数在25M到1.5B的模型进行了预训练，发现文本质量和模型大小与任务准确率之间的相关性显著。此研究为提高语言模型性能提供了新的见解和方法。

受限参数的语言模型与优质数据的扩展