Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., "Chinchilla") scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW $\beta_2$ parameter is essential at lower batch sizes.

Kaplan等人和Hoffmann等人为计算预算的优化模型大小开发了有影响力的扩展定律，但这些定律给出了截然不同的预测结果；通过在两个数据集上重现Kaplan定律，并识别出最终层计算成本、预热时间和规模相关的优化器调整等三个因素，我们解释了差异；在纠正这些因素后，我们与Hoffmann等人（即“Chinchilla”）的定律取得了很好的一致性；与Hoffmann等人的假设相反，我们发现仔细的学习率衰减对于他们的定律的有效性并不重要；作为次要结果，我们推导出了最优学习率和批次大小的扩展定律，并发现在较低的批次大小下调整AdamW的β2参数至关重要。

解决语言模型计算最优扩展的差异