论文标题
使用Parsbert和不平衡数据处理方法检测波斯情绪
Persian Emotion Detection using ParsBERT and Imbalanced Data Handling Approaches
论文作者
论文摘要
情感识别是可以使用从社交媒体空间收集的文本,语音或图像数据来完成的机器学习应用程序之一。检测情绪可以帮助我们在不同领域,包括意见采矿。随着社交媒体的传播,诸如Twitter之类的不同平台已成为数据源,这些平台中使用的语言是非正式的,使情感检测任务变得困难。 Emopars和Armanemo是波斯语的两个新的人类标记的情感数据集。这些数据集,尤其是emopars,在两个类别之间的几个样本之间遭受了不平等的困扰。在本文中,我们评估emopars并将其与Armanemo进行比较。在整个分析中,我们使用数据增强技术,数据重新采样以及具有基于变压器预审前的语言模型(PLM)的类重量来处理这些数据集的不平衡问题。此外,功能选择用于通过强调文本的特定功能来增强模型的性能。此外,我们还提供了从emopars选择数据的新政策,该策略选择了高信心样本;结果,该模型没有看到在训练过程中没有特定情绪的样本。我们的模型分别达到Armanemo和Emopars的宏观平均得分为0.81和0.76,这是这些基准的新最新结果。
Emotion recognition is one of the machine learning applications which can be done using text, speech, or image data gathered from social media spaces. Detecting emotion can help us in different fields, including opinion mining. With the spread of social media, different platforms like Twitter have become data sources, and the language used in these platforms is informal, making the emotion detection task difficult. EmoPars and ArmanEmo are two new human-labeled emotion datasets for the Persian language. These datasets, especially EmoPars, are suffering from inequality between several samples between two classes. In this paper, we evaluate EmoPars and compare them with ArmanEmo. Throughout this analysis, we use data augmentation techniques, data re-sampling, and class-weights with Transformer-based Pretrained Language Models(PLMs) to handle the imbalance problem of these datasets. Moreover, feature selection is used to enhance the models' performance by emphasizing the text's specific features. In addition, we provide a new policy for selecting data from EmoPars, which selects the high-confidence samples; as a result, the model does not see samples that do not have specific emotion during training. Our model reaches a Macro-averaged F1-score of 0.81 and 0.76 on ArmanEmo and EmoPars, respectively, which are new state-of-the-art results in these benchmarks.