Can a mere next-token predictor faithfully model human intelligence? We crystallize this intuitive concern, which is fragmented in the literature. As a starting point, we argue that the two often-conflated phases of next-token prediction -- autoregressive inference and teacher-forced training -- must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacher-forcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacher-forcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically fail in that manner -- remarkably, despite the task being straightforward to learn. We provide preliminary evidence that this failure can be resolved when training to predict multiple tokens in advance. We hope this finding can ground future debates and inspire explorations beyond the next-token prediction paradigm. We make our code available under https://github.com/gregorbachmann/Next-Token-Failures

通过模型中的autoregressive inference和teacher-forced training两个关键阶段的独立处理来解决关于next-token预测的问题，研究揭示了在特定类的任务中，teacher-forcing不仅可能在autoregressive inference阶段出现错误叠加的问题，还可能在首次学习过程中就无法准确预测下一个token的问题。研究通过实验证明了这一问题，并提出通过预测多个token来解决这一失败情况的初步证据。这一发现希望能够引发关于next-token预测范式之外的讨论和探索。

下一个标记预测的陷阱