BriefGPT.xyz
Jun, 2023
评估NMT中基于子词的标记化的频率与组合重要性
Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT
HTML
PDF
Benoist Wolleb, Romain Silvestri, Giorgos Vernikos, Ljiljana Dolamic Andrei Popescu-Belis
TL;DR
该研究探讨了子词分词在神经语言模型和机器翻译系统中的应用,并提出了一种基于Huffman编码的分词方法,表明非常高频的单词分别出现,是达到比贪心算法高的分数的一个相对较重要的因素。
Abstract
subword tokenization
is the de facto standard for tokenization in
neural language models
and
machine translation systems
. Three advantages
→