Uncovering early-stage metrics that reflect final model performance is one core principle for large-scale pretraining. The existing scaling law demonstrates the power-law correlation between pretraining loss and training flops, which serves as an important indicator of the current training state for large language models. However, this principle only focuses on the model's compression properties on the training data, resulting in an inconsistency with the ability improvements on the downstream tasks. Some follow-up works attempted to extend the scaling-law to more complex metrics (such as hyperparameters), but still lacked a comprehensive analysis of the dynamic differences among various capabilities during pretraining. To address the aforementioned limitations, this paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints. Through this analysis, we confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes, up to 67 billion parameters. In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints. This initiative offers valuable resources to the research community and facilitates the verification and exploration of LLM pretraining by open-source researchers. Besides, we provide empirical summaries, including performance comparisons of different models and capabilities, and tuition of key metrics for different training phases. Based on these findings, we provide a more user-friendly strategy for evaluating the optimization state, offering guidance for establishing a stable pretraining process.

通过详细分析不同预训练模型中的不同能力表现，我们确认了特定下游指标在不同大小的模型中展示相似的训练动态，多达670亿参数。此外，我们还复现了Amber和OpenLLaMA，并发布了它们的中间检查点，以为研究界提供宝贵的资源，促进对开源研究人员的LLM预训练进行验证和探索。此外，我们提供了不同模型和能力的性能比较以及不同训练阶段的关键指标指导的实证总结。基于这些发现，我们提供了一种更用户友好的评估优化状态的策略，为建立稳定的预训练流程提供指导。

巧妙之道：利用下游分析能力导航大型语言模型预训练