While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving reasoning over quantities, especially arithmetics. This has particular relevance in scientific datasets where combinations of text and numerical data are abundant. One fundamental limitation is the nature of the CE loss, which assumes a nominal (categorical) scale and thus cannot convey proximity between generated number tokens. As a remedy, we here present two versions of a number token loss. The first is based on an $L_p$ loss between the ground truth token value and the weighted sum of the predicted class probabilities. The second loss minimizes the Wasserstein-1 distance between the distribution of the predicted output probabilities and the ground truth distribution. These regression-like losses can easily be added to any language model and extend the CE objective during training. We compare the proposed schemes on a mathematics dataset against existing tokenization, encoding, and decoding schemes for improving number representation in language models. Our results reveal a significant improvement in numerical accuracy when equipping a standard T5 model with the proposed loss schemes.

本研究解决了语言模型在数字生成和数量推理（特别是算术）方面的不足。我们提出了两种数字 token 损失函数，以克服传统交叉熵损失的局限性，这些损失函数通过度量生成的数字 token 与真实值之间的距离，显著提高了模型的数字准确性，尤其是在标准 T5 模型上表现突出。

回归，而非猜测——一种针对语言模型数字 token 的类回归损失