论文标题
OSACT4在进攻语言检测上共享任务:基于强化预处理的方法
OSACT4 Shared Task on Offensive Language Detection: Intensive Preprocessing-Based Approach
论文作者
论文摘要
预处理阶段是文本分类管道中的关键阶段之一。这项研究旨在研究预处理阶段对文本分类的影响,特别是对阿拉伯文本的进攻性语言和仇恨言论分类。社交媒体中使用的阿拉伯语是非正式的,并使用阿拉伯方言编写,这使文本分类任务非常复杂。预处理有助于降低维度并删除无用的内容。在对数据集进行进一步处理并将其输入分类模型之前,我们将密集的预处理技术应用于数据集。一种基于强化的预处理方法表明,其对进攻性语言检测和仇恨言论检测的重大影响,第四届开源阿拉伯语料库和COLPORA处理工具(OSACT)的共同任务。我们的团队在子任务A进攻性语言检测部门中赢得了第三名(第三名),并在子任务B仇恨言论检测部中赢得了第一名(第1),而F1得分分别为89%和95%,通过提供F1的最新表现,以F1的效果,准确性,召回率,以及阿拉伯仇恨言论的精确表现。
The preprocessing phase is one of the key phases within the text classification pipeline. This study aims at investigating the impact of the preprocessing phase on text classification, specifically on offensive language and hate speech classification for Arabic text. The Arabic language used in social media is informal and written using Arabic dialects, which makes the text classification task very complex. Preprocessing helps in dimensionality reduction and removing useless content. We apply intensive preprocessing techniques to the dataset before processing it further and feeding it into the classification model. An intensive preprocessing-based approach demonstrates its significant impact on offensive language detection and hate speech detection shared tasks of the fourth workshop on Open-Source Arabic Corpora and Corpora Processing Tools (OSACT). Our team wins the third place (3rd) in the Sub-Task A Offensive Language Detection division and wins the first place (1st) in the Sub-Task B Hate Speech Detection division, with an F1 score of 89% and 95%, respectively, by providing the state-of-the-art performance in terms of F1, accuracy, recall, and precision for Arabic hate speech detection.