论文标题

TARC:逐步和半自动收集突尼斯阿拉伯语语料库

TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus

论文作者

Gugliotta, Elisa, Dinarelli, Marco

论文摘要

本文介绍了第一个形成句法的突尼斯阿拉伯语语料库(TARC)的宪法过程。阿拉伯语,也称为阿拉伯齐(Arabizi),是拉丁字符和算法绘画中阿拉伯语方言的自发编码(用作字母的数字)。该代码系统是由社交媒体的讲阿拉伯语用户开发的,以促进计算机介导的通信(CMC)和文本消息传递非正式框架中的写作。方言之间的阿拉伯语实现多种多样,每个阿拉伯代码系统的资源都与大多数阿拉伯语方言相同。在过去的几年中,对NLP领域中阿拉伯语方言的关注已大大增加。考虑到这一点,TARC将是对不同类型的计算和语言分析以及NLP工具培训的有用支持。在本文中,我们将介绍有关TARC半自动施工过程的初步工作,以及我们在TARC上进行的一些首次分析。此外,为了完整概述建筑过程中所面临的挑战,我们将介绍突尼斯的主要方言特征及其在突尼斯阿拉伯语中的编码。

This article describes the constitution process of the first morpho-syntactically annotated Tunisian Arabish Corpus (TArC). Arabish, also known as Arabizi, is a spontaneous coding of Arabic dialects in Latin characters and arithmographs (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the writing in the Computer-Mediated Communication (CMC) and text messaging informal frameworks. There is variety in the realization of Arabish amongst dialects, and each Arabish code-system is under-resourced, in the same way as most of the Arabic dialects. In the last few years, the focus on Arabic dialects in the NLP field has considerably increased. Taking this into consideration, TArC will be a useful support for different types of analyses, computational and linguistic, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses we developed on TArC. In addition, in order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and their encoding in Tunisian Arabish.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源