建模单元分布

Jun, 2021

Modeling the Unigram Distribution

Irene Nikkarinen, Tiago Pimentel, Damián E. Blasi, Ryan Cotterell

TL;DR本文论述了如何正确建模语料库中词汇的频次分布，引入了一种基于神经网络的模型来更好地估算单词的出现概率，实验结果证明该模型在七种语言的语料库中表现良好，优于传统方法。

Abstract

The unigram distribution is the non-contextual probability of finding a specific word form in a corpus. While of central importance to the study of language, it is commonly approximated by each word's sample frequency in the corpus. This approach, being highly dependent on sample size,