BriefGPT.xyz
Apr, 2020
预训练语言模型中的无监督域聚类
Unsupervised Domain Clusters in Pretrained Language Models
HTML
PDF
Roee Aharoni, Yoav Goldberg
TL;DR
本文提出了一种基于大规模预训练语言模型的领域数据选择方法,通过度量句子的隐式相似性进行聚类,仅需要少量数据即可有效提高神经机器翻译的准确性。
Abstract
The notion of "
in-domain data
" in
nlp
is often over-simplistic and vague, as textual data varies in many nuanced linguistic aspects such as topic, style or level of formality. In addition, domain labels are many
→