In this paper, we approach Vietnamese word segmentation as a binary classification by using the Support Vector Machine classifier. We inherit features from prior works such as n-gram of syllables, n-gram of syllable types, and checking conjunction of adjacent syllables in the dictionary. We propose two novel ways to feature extraction, one to reduce the overlap ambiguity and the other to increase the ability to predict unknown words containing suffixes. Different from UETsegmenter and RDRsegmenter, two state-of-the-art Vietnamese word segmentation methods, we do not employ the longest matching algorithm as an initial processing step or any post-processing technique. According to experimental results on benchmark Vietnamese datasets, our proposed method obtained a better F1-score than the prior state-of-the-art methods UETsegmenter, and RDRsegmenter.

使用支持向量机分类器的越南语单词分割方法通过采用音节的 n-gram、音节类型的 n-gram 和在词典中检查相邻音节的连接等方面，继承了以前的工作特征，提出了两种新的特征提取方法，一种是减少重叠歧义，另一种是增加预测包含后缀的未知单词的能力，在基准越南语数据集上，我们提出的方法获得了比先前最先进的方法 UETsegmenter 和 RDRsegmenter 更好的 F1 分数。

基于SVM的越南语分词：减少歧义和捕获后缀