针对三种低资源语言的数据集的开发和深度学习基线，名为实体识别器：Bhojpuri，Maithili和Magahi

论文标题

针对三种低资源语言的数据集的开发和深度学习基线，名为实体识别器：Bhojpuri，Maithili和Magahi

Development of a Dataset and a Deep Learning Baseline Named Entity Recognizer for Three Low Resource Languages: Bhojpuri, Maithili and Magahi

论文作者

Mundotiya, Rajesh Kumar, Kumar, Shantanu, kumar, Ajeet, Chaudhary, Umesh Chandra, Chauhan, Supriya, Mishra, Swasti, Gatla, Praveen, Singh, Anil Kumar

论文摘要

在自然语言处理（NLP）管道中，指定的实体识别（NER）是初步问题之一，它标志着专有名词和其他指定的实体，例如位置，人员，组织，疾病等。这些实体，没有NER模块，不利地影响机器翻译系统的性能。 NER通过单独识别和处理此类实体来帮助克服此问题，尽管它在信息提取系统中也可能有用。 Bhojpuri，Maithili和Magahi是低资源语言，通常称为Purvanchal语言。本文着重于开发用于机器翻译系统的NER基准数据集，该数据集通过注释其可用语料库的某些部分来从这些语言转换为印地语。 Bhojpuri，Maithili和Magahi Corpora具有228373、157468和56190令牌的大小，使用22个实体标签注释。注释考虑了粗粒注释标签，然后是印地语数据集之一中使用的标签集。我们还报告了使用LSTM-CNNS-CRF模型的基于深度学习的基线。通过使用条件随机场模型获得的NER工具的较低基线F1得分为Bhojpuri的96.73，Maithili为93.33，Magahi为95.04。基于深度学习的技术（LSTM-CNNS-CRF）为Bhojpuri实现了96.25，Maithili为93.33，Magahi为95.44。

In Natural Language Processing (NLP) pipelines, Named Entity Recognition (NER) is one of the preliminary problems, which marks proper nouns and other named entities such as Location, Person, Organization, Disease etc. Such entities, without a NER module, adversely affect the performance of a machine translation system. NER helps in overcoming this problem by recognising and handling such entities separately, although it can be useful in Information Extraction systems also. Bhojpuri, Maithili and Magahi are low resource languages, usually known as Purvanchal languages. This paper focuses on the development of a NER benchmark dataset for the Machine Translation systems developed to translate from these languages to Hindi by annotating parts of their available corpora. Bhojpuri, Maithili and Magahi corpora of sizes 228373, 157468 and 56190 tokens, respectively, were annotated using 22 entity labels. The annotation considers coarse-grained annotation labels followed by the tagset used in one of the Hindi NER datasets. We also report a Deep Learning based baseline that uses an LSTM-CNNs-CRF model. The lower baseline F1-scores from the NER tool obtained by using Conditional Random Fields models are 96.73 for Bhojpuri, 93.33 for Maithili and 95.04 for Magahi. The Deep Learning-based technique (LSTM-CNNs-CRF) achieved 96.25 for Bhojpuri, 93.33 for Maithili and 95.44 for Magahi.

下载PDF全文

下载文献需遵守相关版权规定

论文标题