在非拉丁文脚本中使用punkt进行句子分割：库尔德（索拉尼）文本的实验

论文标题

在非拉丁文脚本中使用punkt进行句子分割：库尔德（索拉尼）文本的实验

Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments on Kurdish (Sorani) Texts

论文作者

Abdulrahman, Roshna Omer, Hassani, Hossein

论文摘要

细分是大多数自然语言处理任务的基本步骤。库尔德语言是一种多语言，资源不足的语言，用不同的脚本编写。缺乏各种分段的语料库是库尔德语言处理中的主要瓶颈之一。我们使用了一种无监督的机器学习方法Punkt来分割用波斯 - 阿拉伯语脚本编写的Sorani方言的库尔德语料库。根据文献，关于在非拉丁蛋白数据上使用PUNKT的研究很少。在我们的实验中，我们的F1得分为91.10％，错误率为16.32％。高错误率主要是由于库尔德人缩写的情况，部分原因是序数数字。该数据可在https://github.com/kurdishblark/ ktc-reagented上公开获取，用于在CC BY-NC-SA 4.0许可下进行非商业用途。

Segmentation is a fundamental step for most Natural Language Processing tasks. The Kurdish language is a multi-dialect, under-resourced language which is written in different scripts. The lack of various segmented corpora is one of the major bottlenecks in Kurdish language processing. We used Punkt, an unsupervised machine learning method, to segment a Kurdish corpus of Sorani dialect, written in Persian-Arabic script. According to the literature, studies on using Punkt on non-Latin data are scanty. In our experiment, we achieved an F1 score of 91.10% and had an Error Rate of 16.32%. The high Error Rate is mainly due to the situation of abbreviations in Kurdish and partly because of ordinal numerals. The data is publicly available at https://github.com/KurdishBLARK/ KTC-Segmented for non-commercial use under the CC BY-NC-SA 4.0 licence.

下载PDF全文

下载文献需遵守相关版权规定

论文标题