论文标题
Curras + Baladi:走向黎凡特语料库
Curras + Baladi: Towards a Levantine Corpus
论文作者
论文摘要
阿拉伯语的处理是一个复杂的研究领域。这是由于许多因素,包括阿拉伯语的复杂和丰富的形态,其高度的歧义,以及在考虑其独特特征的同时需要处理几种需要处理的几种区域性品种。当考虑其方言时,该语言会推动NLP的限制,以找到解决方案的固有性质所带来的解决方案。这是一种挖掘语言;标准语言用于正式环境和教育中,与不同地区所说的白话语言完全不同,并且受这些地区历史上使用的较旧语言的影响。这应该鼓励NLP专家创建特定于方言的语料库,例如Birzeit大学的巴勒斯坦形态学注释的Curras语料库。在这项工作中,我们介绍了黎巴嫩语料库Baladi,其中包括大约9.6k的形态学注释令牌。由于黎巴嫩和巴勒斯坦方言是同一黎凡特方言连续体的一部分,因此高度可理解,因此我们提出的语料库被构造为用于(1)富集咖喱并将其转化为更一般的黎凡特语体和(2)通过解决检测错误来改善Curras。
The processing of the Arabic language is a complex field of research. This is due to many factors, including the complex and rich morphology of Arabic, its high degree of ambiguity, and the presence of several regional varieties that need to be processed while taking into account their unique characteristics. When its dialects are taken into account, this language pushes the limits of NLP to find solutions to problems posed by its inherent nature. It is a diglossic language; the standard language is used in formal settings and in education and is quite different from the vernacular languages spoken in the different regions and influenced by older languages that were historically spoken in those regions. This should encourage NLP specialists to create dialect-specific corpora such as the Palestinian morphologically annotated Curras corpus of Birzeit University. In this work, we present the Lebanese Corpus Baladi that consists of around 9.6K morphologically annotated tokens. Since Lebanese and Palestinian dialects are part of the same Levantine dialectal continuum, and thus highly mutually intelligible, our proposed corpus was constructed to be used to (1) enrich Curras and transform it into a more general Levantine corpus and (2) improve Curras by solving detected errors.