BriefGPT.xyz
Jun, 2024
无监督形态树分词器
Unsupervised Morphological Tree Tokenizer
HTML
PDF
Qingyang Zhu, Xiang Hu, Pengyu Ji, Wei Wu, Kewei Tu
TL;DR
通过引入形态结构指导标记,提出了一种深度模型来诱导单词的字符级结构,该方法在形态分割任务和语言建模任务上表现良好,并优于BPE和WordPiece等广泛采用的方法。
Abstract
As a cornerstone in
language modeling
,
tokenization
involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby co
→