语音中的词边界挖掘作为自然标注的词分割数据

Oct, 2022

语音中的词边界挖掘作为自然标注的词分割数据

Mining Word Boundaries in Speech as Naturally Annotated Word Segmentation Data

Lei Zhang, Shilin Zhou, Chen Gong, Zhenghua Li, Zhefeng Wang...

TL;DR本研究提出了一种在跨领域和低资源情况下提高中文分词性能的方法，即从语音中的停顿中挖掘自然标注数据来训练CWS模型，并证明该方法能够显著提高CWS的性能。

Abstract

chinese word segmentation (CWS) models have achieved very high performance when the training data is sufficient and in-domain. However, the performance drops drastically when shifting to cross-domain and low-resource sc