How does scaling the number of parameters in large language models (LLMs) affect their core capabilities? We study two natural scaling techniques -- weight pruning and simply training a smaller or larger model, which we refer to as dense scaling -- and their effects on two core capabilities of LLMs: (a) recalling facts presented during pre-training and (b) processing information presented in-context during inference. By curating a suite of tasks that help disentangle these two capabilities, we find a striking difference in how these two abilities evolve due to scaling. Reducing the model size by more than 30\% (via either scaling approach) significantly decreases the ability to recall facts seen in pre-training. Yet, a 60--70\% reduction largely preserves the various ways the model can process in-context information, ranging from retrieving answers from a long context to learning parameterized functions from in-context exemplars. The fact that both dense scaling and weight pruning exhibit this behavior suggests that scaling model size has an inherently disparate effect on fact recall and in-context learning.

参数扩展对大型语言模型核心能力的影响是如何的？我们研究了两种常见的扩展技术——权重剪枝和直接训练更小或更大的模型，并分析了它们对语言模型的两个核心能力的影响：(a) 回忆在预训练阶段出现过的事实；(b) 处理推理过程中的上下文信息。通过设计一系列任务，我们发现了这两个能力在不同扩展方式下的差异。将模型大小减少30%以上（通过任何扩展方法）会显著降低回忆预训练中出现的事实的能力，但是将模型大小减少60-70%则大致保留了模型在处理上下文信息时的各种方式，从从长篇文本中检索答案到通过上下文示例学习参数化函数。密集扩展和权重剪枝都表现出这种行为，这表明模型大小的扩展对于事实回忆和上下文学习具有本质上不同的影响。

语言模型的降低规模成本：在上下文学习之前事实记忆退化