分析子词切分的认知可信度

Oct, 2023

Analyzing Cognitive Plausibility of Subword Tokenization

Lisa Beinborn, Yuval Pinter

TL;DR对比了三种分词算法在多种语言和词汇量上，发现UnigramLM算法在分词行为上的认知合理性较低，且派生形态的覆盖率较低。

Abstract

subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect