Large language models exhibit exceptional generalization capabilities, primarily attributed to the utilization of diversely sourced data. However, conventional practices in integrating this diverse data heavily rely on heuristic schemes, lacking theoretical guidance. This research tackles these limitations by investigating strategies based on low-cost proxies for data mixtures, with the aim of streamlining data curation to enhance training efficiency. Specifically, we propose a unified scaling law, termed BiMix, which accurately models the bivariate scaling behaviors of both data quantity and mixing proportions. We conduct systematic experiments and provide empirical evidence for the predictive power and fundamental principles of BiMix. Notably, our findings reveal that entropy-driven training-free data mixtures can achieve comparable or even better performance than more resource-intensive methods. We hope that our quantitative insights can shed light on further judicious research and development in cost-effective language modeling.

本研究提出了一种统一的缩放定律 BiMix，准确地模拟了数据数量和混合比例的双变量缩放行为，通过使用低成本的代理策略优化数据筛选，以提高训练效率。实验证据表明，基于熵驱动的无需训练的数据混合方法可以实现与更消耗资源的方法相媲美甚至更好的性能。我们希望这些定量研究结果能为高效语言建模的进一步研究和开发提供启示。

数据混合的高效实现：语言模型预训练的双变量缩放法