论文标题
NLPDOVE在Semeval-2020任务12:通过跨语言转移改善进攻性语言检测
NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer
论文作者
论文摘要
本文描述了我们在多语言环境中识别进攻语言的任务的方法。我们研究了两种数据增强策略:使用具有不同阈值的其他半监督标签,并使用数据选择进行跨语性转移。与仅使用手动注销的数据集进行的基线相比,利用半监督数据集可导致性能提高。我们提出了一个新的度量,翻译嵌入距离,以衡量跨语义数据选择实例的可传递性。我们还介绍了针对社交媒体文本量身定制的各种预处理步骤,以及用于微调预训练的多语言BERT(MBERT)的方法,以进行进攻性语言识别。我们的多语言系统在2020年进攻中获得了希腊,丹麦和土耳其的竞争结果。
This paper describes our approach to the task of identifying offensive languages in a multilingual setting. We investigate two data augmentation strategies: using additional semi-supervised labels with different thresholds and cross-lingual transfer with data selection. Leveraging the semi-supervised dataset resulted in performance improvements compared to the baseline trained solely with the manually-annotated dataset. We propose a new metric, Translation Embedding Distance, to measure the transferability of instances for cross-lingual data selection. We also introduce various preprocessing steps tailored for social media text along with methods to fine-tune the pre-trained multilingual BERT (mBERT) for offensive language identification. Our multilingual systems achieved competitive results in Greek, Danish, and Turkish at OffensEval 2020.