BriefGPT.xyz
Oct, 2023
分析子词切分的认知可信度
Analyzing Cognitive Plausibility of Subword Tokenization
HTML
PDF
Lisa Beinborn, Yuval Pinter
TL;DR
对比了三种分词算法在多种语言和词汇量上,发现UnigramLM算法在分词行为上的认知合理性较低,且派生形态的覆盖率较低。
Abstract
subword tokenization
has become the de-facto standard for tokenization, although comparative evaluations of subword
vocabulary quality
across languages are scarce. Existing evaluation studies focus on the effect
→