多语言仇恨语音检测的深度学习模型

论文标题

多语言仇恨语音检测的深度学习模型

Deep Learning Models for Multilingual Hate Speech Detection

论文作者

Aluru, Sai Saketh, Mathew, Binny, Saha, Punyajoy, Mukherjee, Animesh

论文摘要

仇恨言语检测是一个充满挑战的问题，仅使用一种语言可用的大多数数据集：英语。在本文中，我们对16种不同来源的9种语言的多语言仇恨言论进行了大规模分析。我们观察到，在低资源设置中，简单的模型（例如logistic回归嵌入激光嵌入）表现最好，而在高资源设置基于BERT的模型中，基于BERT的模型的性能更好。如果进行零拍，意大利语和葡萄牙语等语言取得了良好的结果。我们提出的框架可以用作低资源语言的有效解决方案。这些模型还可以作为未来多语言仇恨言论检测任务的好基础。我们已经在https://github.com/punyajoy/de-limit上为其他研究人员公开了代码和实验环境。

Hate speech detection is a challenging problem with most of the datasets available in only one language: English. In this paper, we conduct a large scale analysis of multilingual hate speech in 9 languages from 16 different sources. We observe that in low resource setting, simple models such as LASER embedding with logistic regression performs the best, while in high resource setting BERT based models perform better. In case of zero-shot classification, languages such as Italian and Portuguese achieve good results. Our proposed framework could be used as an efficient solution for low-resource languages. These models could also act as good baselines for future multilingual hate speech detection tasks. We have made our code and experimental settings public for other researchers at https://github.com/punyajoy/DE-LIMIT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题