BriefGPT.xyz
Mar, 2024
拆解标记化: 评估文本压缩及其与模型性能的相关性
Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance
HTML
PDF
Omer Goldman, Avi Caciularu, Matan Eyal, Kris Cao, Idan Szpektor...
TL;DR
通过变化训练数据的数量,我们研究了BPE tokenizers的压缩能力对预训练语言模型下游性能的影响,我们发现压缩能力与模型性能存在相关性,因此构建压缩效果更好的tokenizer是一个有前景的研究方向。
Abstract
Despite it being the cornerstone of BPE, the most common
tokenization
algorithm, the importance of
compression
in the
tokenization
process
→