Kaplan et al. [2020] (`Kaplan') and Hoffmann et al. [2022] (`Chinchilla')
studied the scaling behavior of transformers trained on next-token language
prediction. These studies produced different estimates for how the number of
parameters ($N$) and training tokens ($D$) should be set to achieve the lowest
possible loss for a given compute budget ($C$). Kaplan: $N_\text{optimal}
\propto C^{0.73}$, Chinchilla: $N_\text{optimal} \propto C^{0.50}$. This note
finds that much of this discrepancy can be attributed to Kaplan counting
non-embedding rather than total parameters, combined with their analysis being
performed at small scale. Simulating the Chinchilla study under these
conditions produces biased scaling coefficients close to Kaplan's. Hence, this
note reaffirms Chinchilla's scaling coefficients, by explaining the cause of
Kaplan's original overestimation.

该研究重点研究了 transformers 在语言预测任务中的缩放行为，探讨了参数设置和计算预算对模型性能的影响，并解释了 Kaplan 等人估计过高的原因。

协调 Kaplan 和 Chinchilla 比例定律

Reconciling Kaplan and Chinchilla Scaling Laws

We study the probabilistic modeling performed by Autoregressive Large
Language Models through the angle of time directionality. We empirically find a
time asymmetry exhibited by such models in their ability to model natural
language: a difference in the average log-perplexity when trying to predict the
next token versus when trying to predict the previous one. This difference is
at the same time subtle and very consistent across various modalities
(language, model size, training time, ...). Theoretically, this is surprising:
from an information-theoretic point of view, there should be no such
difference. We provide a theoretical framework to explain how such an asymmetry
can appear from sparsity and computational complexity considerations, and
outline a number of perspectives opened by our results.

我们通过时间方向性的角度研究自回归大规模语言模型的概率建模，经验性发现这类模型在自然语言建模方面存在时间上的非对称性：在预测下一个标记和预测上一个标记时的平均对数困惑度存在差异。这种差异在多个模式（语言、模型大小、训练时间等）上既微妙又非常一致。从信息论的角度来看，理论上认为不应该存在这种差异。我们提供了一个理论框架，解释了稀疏性和计算复杂性考虑如何导致这种非对称性，并概述了我们结果带来的一些新的研究方向。