论文标题

使用丰富的BilstM模型,阿拉伯语变质恢复

Arabic Diacritic Recovery Using a Feature-Rich biLSTM Model

论文作者

Darwish, Kareem, Abdelali, Ahmed, Mubarak, Hamdy, Eldesouki, Mohamed

论文摘要

在编写阿拉伯文字时,通常会省略变音符号(短元音),读者必须重新引入它们以正确发音单词。阿拉伯语中有两种类型:第一种是核心字数字(CW),它指定了词汇选择,第二个是案例结尾(CE)​​,通常出现在词干的末尾,通常指定其句法角色。恢复CES比恢复由于间接依赖性而恢复核心字数字的比较困难,这通常是遥远的。在本文中,我们使用了功能丰富的复发性神经网络模型,该模型使用各种语言和表面水平的特征来恢复核心单词变音符号和案例结尾。我们的模型超过了所有先前的最先进系统,CW错误率(CWER)为2.86 \%,现代标准阿拉伯语(MSA)(MSA)的CE错误率为3.7%,经典阿拉伯语(CA)的CWER为2.2%,CEER为2.5%。当结合了变分式的单词核心与案例结尾时,MSA和CA的结果错误率分别为6.0%和4.3%。这突出了特征工程对这种深层神经模型的有效性。

Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: the first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of the word stem and generally specify their syntactic roles. Recovering CEs is relatively harder than recovering core-word diacritics due to inter-word dependencies, which are often distant. In this paper, we use a feature-rich recurrent neural network model that uses a variety of linguistic and surface-level features to recover both core word diacritics and case endings. Our model surpasses all previous state-of-the-art systems with a CW error rate (CWER) of 2.86\% and a CE error rate (CEER) of 3.7% for Modern Standard Arabic (MSA) and CWER of 2.2% and CEER of 2.5% for Classical Arabic (CA). When combining diacritized word cores with case endings, the resultant word error rate is 6.0% and 4.3% for MSA and CA respectively. This highlights the effectiveness of feature engineering for such deep neural models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源