论文标题
使用SVM进行越南单词分割:含糊不清和后缀捕获
Vietnamese Word Segmentation with SVM: Ambiguity Reduction and Suffix Capture
论文作者
论文摘要
在本文中,我们使用支持向量机分类器将越南单词分割作为二进制分类。我们从先前的作品中继承了功能,例如音节的n-gram,音节类型的n-gram以及词典中相邻音节的连词。我们提出了两种新颖的方法来提取提取,一种是减少重叠的歧义,另一种是提高预测含有后缀的未知单词的能力。与UetSementer和RdrSeggenter不同,两种最先进的越南单词分割方法,我们不使用最长的匹配算法作为初始处理步骤或任何后处理技术。根据基准越南数据集的实验结果,我们提出的方法获得的F1得分比先前的最新方法UeTementeger和rdrSegmenter更好。
In this paper, we approach Vietnamese word segmentation as a binary classification by using the Support Vector Machine classifier. We inherit features from prior works such as n-gram of syllables, n-gram of syllable types, and checking conjunction of adjacent syllables in the dictionary. We propose two novel ways to feature extraction, one to reduce the overlap ambiguity and the other to increase the ability to predict unknown words containing suffixes. Different from UETsegmenter and RDRsegmenter, two state-of-the-art Vietnamese word segmentation methods, we do not employ the longest matching algorithm as an initial processing step or any post-processing technique. According to experimental results on benchmark Vietnamese datasets, our proposed method obtained a better F1-score than the prior state-of-the-art methods UETsegmenter, and RDRsegmenter.