In recent years, language models have drastically grown in size, and the abilities of these models have been shown to improve with scale. The majority of recent scaling laws studies focused on high-compute high-parameter count settings, leaving the question of when these abilities begin to emerge largely unanswered. In this paper, we investigate whether the effects of pre-training can be observed when the problem size is reduced, modeling a smaller, reduced-vocabulary language. We show the benefits of pre-training with masked language modeling (MLM) objective in models as small as 1.25M parameters, and establish a strong correlation between pre-training perplexity and downstream performance (GLUE benchmark). We examine downscaling effects, extending scaling laws to models as small as ~1M parameters. At this scale, we observe a break of the power law for compute-optimal models and show that the MLM loss does not scale smoothly with compute-cost (FLOPs) below $2.2 \times 10^{15}$ FLOPs. We also find that adding layers does not always benefit downstream performance.

本文研究小规模的语言模型中pre-training效果的影响，发现masked language modeling对于1.25M及以上规模的模型具有优化效果，并建立了pre-training perplexity和下游任务(GLUE benchmark)表现的强关联性。同时，研究了downscaling effects，并且观察到FLOPs小于$2.2×10^{15}$时，MLM loss并不随着计算成本(FLOPs)的降低而平滑缩小，增加层数并不总是有助于提高下游表现。

语言缩水了：缩减规模后的语言模型行为