论文标题
LSCP:增强的大规模口语波斯语言理解
LSCP: Enhanced Large Scale Colloquial Persian Language Understanding
论文作者
论文摘要
近年来,通过现代机器学习方法(例如深度学习和具有丰富注释的基准),语言识别已得到显着提高。但是,研究仍然受到低资源形式语言的限制。这包括描述俗语的显着差距,尤其是对于波斯语等低资源的语言。为了针对低资源语言的这一差距,我们提出了一个“大规模口语波斯数据集”(LSCP)。 LSCP是在语义分类法中层次组织的,该语义分类学的重点是多任务非正式的波斯语言理解是一个全面的问题。这涵盖了人类级句子中对多个语义方面的认识,这些句子自然地捕获了现实世界中的句子。我们认为,进一步的研究和处理以及新颖的算法和方法的应用可以增强对低资源语言的计算机化理解和处理。所提出的语料库由12000万个句子组成,该句子由27m的推文带有解析树,言论的一部分标签,情感极性和五种不同语言的翻译。
Language recognition has been significantly advanced in recent years by means of modern machine learning methods such as deep learning and benchmarks with rich annotations. However, research is still limited in low-resource formal languages. This consists of a significant gap in describing the colloquial language especially for low-resourced ones such as Persian. In order to target this gap for low resource languages, we propose a "Large Scale Colloquial Persian Dataset" (LSCP). LSCP is hierarchically organized in a semantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. This encompasses the recognition of multiple semantic aspects in the human-level sentences, which naturally captures from the real-world sentences. We believe that further investigations and processing, as well as the application of novel algorithms and methods, can strengthen enriching computerized understanding and processing of low resource languages. The proposed corpus consists of 120M sentences resulted from 27M tweets annotated with parsing tree, part-of-speech tags, sentiment polarity and translation in five different languages.