Toxigen：用于对抗和隐性仇恨言语检测的大型机器生成的数据集

论文标题

Toxigen：用于对抗和隐性仇恨言语检测的大型机器生成的数据集

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

论文作者

Hartvigsen, Thomas, Gabriel, Saadia, Palangi, Hamid, Sap, Maarten, Ray, Dipankar, Kamar, Ece

论文摘要

有毒语言检测系统通常会错误地将包含少数群体群体的文本提到有毒，因为这些群体通常是在线仇恨的目标。这种对虚假相关性的过度依赖也导致系统在检测隐式有毒语言方面挣扎。为了减轻这些问题，我们创建了Toxigen，这是一个新的大规模和机器生成的数据集，该数据集是274K有毒和良性陈述，约有13个少数群体。我们开发了一种基于示范的提示框架和一种对抗性分类器，在循环解码方法中，以大量预审前的语言模型生成微妙的有毒和良性文本。以这种方式控制机器的生成使Toxigen可以比以前的书面文本资源更大的规模和大约人口组覆盖隐式有毒的文本。我们对毒素的挑战子集进行了人类评估，发现注释者很难将机器生成的文本与人撰写的语言区分开。我们还发现，94.5％的有毒例子被人类注释者标记为仇恨言论。我们使用三个公开可用的数据集，我们表明，对我们的数据进行毒性分类器的鉴定可以大大提高其对人体编号数据的性能。我们还证明，毒素可用于抵抗机器生成的毒性，因为鉴定在我们的评估子集中大大改善了分类器。我们的代码和数据可以在https://github.com/microsoft/toxigen上找到。

Toxic language detection systems often falsely flag text that contains minority group mentions as toxic, as those groups are often the targets of online hate. Such over-reliance on spurious correlations also causes systems to struggle with detecting implicitly toxic language. To help mitigate these issues, we create ToxiGen, a new large-scale and machine-generated dataset of 274k toxic and benign statements about 13 minority groups. We develop a demonstration-based prompting framework and an adversarial classifier-in-the-loop decoding method to generate subtly toxic and benign text with a massive pretrained language model. Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale, and about more demographic groups, than previous resources of human-written text. We conduct a human evaluation on a challenging subset of ToxiGen and find that annotators struggle to distinguish machine-generated text from human-written language. We also find that 94.5% of toxic examples are labeled as hate speech by human annotators. Using three publicly-available datasets, we show that finetuning a toxicity classifier on our data improves its performance on human-written data substantially. We also demonstrate that ToxiGen can be used to fight machine-generated toxicity as finetuning improves the classifier significantly on our evaluation subset. Our code and data can be found at https://github.com/microsoft/ToxiGen.

下载PDF全文

下载文献需遵守相关版权规定

论文标题