Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.

本研究解决了多令牌预测在语言模型预训练中的应用效果未能普遍推广到微调等其他场景的问题。我们提出的MuToR方法通过将可学习的寄存器令牌交错到输入序列中，旨在有效地进行未来目标的预测。研究表明，MuToR在多种应用场景中表现出色，尤其适用于有监督的微调任务，并且保持与传统下一令牌预训练目标的一致性。

多令牌预测需要寄存器