BriefGPT.xyz
May, 2023
自我监督多语言无标点符号句子分割
Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation
HTML
PDF
Benjamin Minixhofer, Jonas Pfeiffer, Ivan Vulić
TL;DR
本文提出了一种多语言无标点自我监督句子分割方法,用未分割文本中的换行符进行分段,利用少量标注的样本即可适应分割不同语料库。作者通过使用该方法与训练MT模型相匹配的句子分割方式,在BLEU分数和MT翻译质量上取得了显著改进。
Abstract
Many
nlp
pipelines split text into sentences as one of the crucial preprocessing steps. Prior
sentence segmentation
tools either rely on punctuation or require a considerable amount of sentence-segmented training
→