Next-token prediction models have predominantly relied on decoder-only Transformers with causal attention, driven by the common belief that causal attention is essential to prevent "cheating" by masking future tokens. We challenge this widely accepted notion and argue that this design choice is about efficiency rather than necessity. While decoder-only Transformers are still a good choice for practical reasons, they are not the only viable option. In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP. We introduce the Triplet-Counting task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate ENTP's superior performance across various realistic tasks, such as length generalization and in-context learning.

本研究针对广泛使用的仅解码器Transformer在下一个token预测中的设计选择提出质疑，认为其主要是出于效率而非必要性。通过引入仅编码器的下一个token预测（ENTP）模型，发现其在表达能力和复杂性上具有潜在优势，并且在实际任务中表现优于传统解码器模型。

ENTP：仅编码器的下一个token预测