NLPDOVE在Semeval-2020任务12：通过跨语言转移改善进攻性语言检测

论文标题

NLPDOVE在Semeval-2020任务12：通过跨语言转移改善进攻性语言检测

NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer

论文作者

Ahn, Hwijeen, Sun, Jimin, Park, Chan Young, Seo, Jungyun

论文摘要

本文描述了我们在多语言环境中识别进攻语言的任务的方法。我们研究了两种数据增强策略：使用具有不同阈值的其他半监督标签，并使用数据选择进行跨语性转移。与仅使用手动注销的数据集进行的基线相比，利用半监督数据集可导致性能提高。我们提出了一个新的度量，翻译嵌入距离，以衡量跨语义数据选择实例的可传递性。我们还介绍了针对社交媒体文本量身定制的各种预处理步骤，以及用于微调预训练的多语言BERT（MBERT）的方法，以进行进攻性语言识别。我们的多语言系统在2020年进攻中获得了希腊，丹麦和土耳其的竞争结果。

This paper describes our approach to the task of identifying offensive languages in a multilingual setting. We investigate two data augmentation strategies: using additional semi-supervised labels with different thresholds and cross-lingual transfer with data selection. Leveraging the semi-supervised dataset resulted in performance improvements compared to the baseline trained solely with the manually-annotated dataset. We propose a new metric, Translation Embedding Distance, to measure the transferability of instances for cross-lingual data selection. We also introduce various preprocessing steps tailored for social media text along with methods to fine-tune the pre-trained multilingual BERT (mBERT) for offensive language identification. Our multilingual systems achieved competitive results in Greek, Danish, and Turkish at OffensEval 2020.

下载PDF全文

下载文献需遵守相关版权规定

论文标题