BriefGPT.xyz
Apr, 2022
令牌化对语言模型的影响:针对土耳其语的分析
Impact of Tokenization on Language Models: An Analysis for Turkish
HTML
PDF
Cagri Toraman, Eyup Halit Yilmaz, Furkan Şahinuç, Oguzhan Ozcelik
TL;DR
本文研究了在土耳其语(OSCAR corpus)的分裂数据上,比较了不同粒度级别的分词器的性能和预训练语言模型的效果,并发现单独定制的分子级别分词器具有挑战性的表现,同时也发现增加词汇量可以提高单独定制的分子级别分词器以及使用RoBERTa预训练的中型语言模型的性能。
Abstract
tokenization
is an important text preprocessing step to prepare input tokens for deep
language models
. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impac
→