Large language models display remarkable capabilities in logical and mathematical reasoning, allowing them to solve complex tasks. Interestingly, these abilities emerge in networks trained on the simple task of next-token prediction. In this work, we present a theoretical framework for studying auto-regressive next-token predictors. We demonstrate that even simple models such as linear next-token predictors, trained on Chain-of-Thought (CoT) data, can approximate any function efficiently computed by a Turing machine. We introduce a new complexity measure -- length complexity -- which measures the number of intermediate tokens in a CoT sequence required to approximate some target function, and analyze the interplay between length complexity and other notions of complexity. Finally, we show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer Perceptrons (MLPs), display non-trivial performance on text generation and arithmetic tasks. Our results demonstrate that the power of language models can be attributed, to a great extent, to the auto-regressive next-token training scheme, and not necessarily to a particular choice of architecture.

大型语言模型在逻辑和数学推理方面显示出令人瞩目的能力，使它们能够解决复杂的任务。本文提出了一个理论框架来研究自回归的下一个标记预测器。我们证明，即使是简单的模型，如线性的下一个标记预测器在Chain-of-Thought（CoT）数据上训练，也能有效地近似于图灵机计算的任何函数。我们引入了一个新的复杂度度量方法——长度复杂度，它衡量了实现某个目标函数所需的CoT序列中的中间标记数，并分析了长度复杂度与其他复杂度概念之间的相互关系。最后，我们通过实验证明，简单的下一个标记预测器，如线性网络和浅层多层感知器（MLP），在文本生成和算术任务中显示出非平凡的性能。我们的结果表明，语言模型的强大能力很大程度上归功于自回归的下一个标记训练方案，而不一定取决于特定的架构选择。

自回归下一个单词预测器是通用学习者