BriefGPT.xyz
Jun, 2024
理解和减轻语言模型中的分词偏差
Understanding and Mitigating Tokenization Bias in Language Models
HTML
PDF
Buu Phan, Marton Havasi, Matthew Muckley, Karen Ullrich
TL;DR
通过提出一种新颖的算法,我们可以从单词化数据中得到无偏估计,而不需要调整模型。通过 Markov 链设置,我们从标记化语言模型中精准恢复了转换概率。
Abstract
State-of-the-art
language models
are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the
language models
→