阿拉伯语Maghrib {ī}手稿的文本识别的新结果 - 管理资源不足的脚本

论文标题

阿拉伯语Maghrib {ī}手稿的文本识别的新结果 - 管理资源不足的脚本

New Results for the Text Recognition of Arabic Maghrib{ī} Manuscripts -- Managing an Under-resourced Script

论文作者

Noëmie, Lucas, Salah, Clément, Vidal-Gorène, Chahan

论文摘要

HTR模型开发已成为数字人文项目的常规步骤。这些模型的性能通常很高，依赖于手动转录和众多手写文档。尽管该方法在拉丁文脚本上已被证明是成功的，但是对于被认为是像阿拉伯语脚本的脚本不佳的脚本有关的脚本尚不可实现类似的数据。在这方面，我们正在介绍和评估用于HTR模型开发和专门针对阿拉伯语Maghrib {ī}脚本的微调的新型作案。几个最先进的HTR之间的比较证明了一种专门用于阿拉伯语的基于单词的神经方法的相关性，该方法能够达到5％以下的错误率，而仅手动转录10页。这些结果为阿拉伯语脚本处理开辟了新的观点，更普遍地是针对不良语言处理的。这项研究是与GIS MOMM和BULAC合作开发RASAM数据集的一部分。

HTR models development has become a conventional step for digital humanities projects. The performance of these models, often quite high, relies on manual transcription and numerous handwritten documents. Although the method has proven successful for Latin scripts, a similar amount of data is not yet achievable for scripts considered poorly-endowed, like Arabic scripts. In that respect, we are introducing and assessing a new modus operandi for HTR models development and fine-tuning dedicated to the Arabic Maghrib{ī} scripts. The comparison between several state-of-the-art HTR demonstrates the relevance of a word-based neural approach specialized for Arabic, capable to achieve an error rate below 5% with only 10 pages manually transcribed. These results open new perspectives for Arabic scripts processing and more generally for poorly-endowed languages processing. This research is part of the development of RASAM dataset in partnership with the GIS MOMM and the BULAC.

下载PDF全文

下载文献需遵守相关版权规定

论文标题