理解和减轻语言模型中的分词偏差

Jun, 2024

理解和减轻语言模型中的分词偏差

Understanding and Mitigating Tokenization Bias in Language Models

Buu Phan, Marton Havasi, Matthew Muckley, Karen Ullrich

TL;DR通过提出一种新颖的算法，我们可以从单词化数据中得到无偏估计，而不需要调整模型。通过 Markov 链设置，我们从标记化语言模型中精准恢复了转换概率。

Abstract

State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models