Neural scaling laws (NSL) refer to the phenomenon where model performance improves with scale. Sharma & Kaplan analyzed NSL using approximation theory and predict that MSE losses decay as $N^{-\alpha}$, $\alpha=4/d$, where $N$ is the number of model parameters, and $d$ is the intrinsic input dimension. Although their theory works well for some cases (e.g., ReLU networks), we surprisingly find that a simple 1D problem $y=x^2$ manifests a different scaling law ($\alpha=1$) from their predictions ($\alpha=4$). We opened the neural networks and found that the new scaling law originates from lottery ticket ensembling: a wider network on average has more "lottery tickets", which are ensembled to reduce the variance of outputs. We support the ensembling mechanism by mechanistically interpreting single neural networks, as well as studying them statistically. We attribute the $N^{-1}$ scaling law to the "central limit theorem" of lottery tickets. Finally, we discuss its potential implications for large language models and statistical physics-type theories of learning.

神经缩放定律是指模型性能随规模增加而改善的现象。本文通过逼近理论分析了神经缩放定律，并预测MSE损失随着参数数量的减小而衰减，其中N是模型参数数量，d是固有输入维度。虽然他们的理论对某些情况（例如ReLU网络）有效，但令人惊讶的是，我们发现简单的一维问题y=x^2表现出与他们预测（α=4）不同的缩放定律（α=1）。我们研究了神经网络并发现新的缩放定律源于“大乐透”模型：平均而言，更宽的网络具有更多“大乐透号码”，而这些号码被集成以减少输出的方差。我们通过对单个神经网络的机械解释和统计研究来支持集成机制。将N^{-1}的缩放定律归因于大乐透的“中心极限定理”。最后，我们讨论了它对大型语言模型和学习的统计物理类型理论的潜在影响。

通过中奖彩票集成的神经缩放定律