BriefGPT.xyz
Jun, 2024
分词不足:分词的诅咒
Tokenization Falling Short: The Curse of Tokenization
HTML
PDF
Yekun Chai, Yewei Fang, Qiwei Peng, Xuhong Li
TL;DR
大型语言模型存在分词问题,导致对错别字、长度差异和标记内部结构的忽视。本研究通过探究复杂问题解决、标记结构探测和对错别字的抵抗力来系统地调查这些挑战及其对大型语言模型的影响,并展示模型参数缩放与子词规范化对解决这些问题的作用。
Abstract
language models
typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to
typographical errors
, length variations, and largely oblivious to
→