论文标题
Stanceosaurus:对多语言错误信息进行分类
Stanceosaurus: Classifying Stance Towards Multilingual Misinformation
论文作者
论文摘要
我们介绍了Stanceosaurus,这是一个新的语料库,其中包含28,033条英语,印地语和阿拉伯语,并以251个错误信息主张的立场注释。据我们所知,它是对错误信息主张的立场注释的最大语料库。 Stanceosaurus中的主张源自15种涵盖各种地理区域和文化的事实检查来源。与现有的姿态数据集不同,我们引入了一个更细颗粒的5级标签策略,并具有其他子类别,以区分隐式立场。在我们的语料库中进行了微调的基于预训练的基于变压器的立场分类器对培训数据以外的国家的未见主张和区域索赔表现出良好的概括。跨语性实验证明了Stanceosaurus训练多语言模型的能力,在印地语上实现了53.1 F1,而在阿拉伯语上获得了50.4 F1,而没有任何目标语言微调。最后,我们展示了如何使用其他RumouReval-2019数据使用域适应方法来提高stanceosaurus的性能。我们将Stanceosaurus公开向研究界公开,并希望它将鼓励在语言和文化之间进行进一步的错误信息识别。
We present Stanceosaurus, a new corpus of 28,033 tweets in English, Hindi, and Arabic annotated with stance towards 251 misinformation claims. As far as we are aware, it is the largest corpus annotated with stance towards misinformation claims. The claims in Stanceosaurus originate from 15 fact-checking sources that cover diverse geographical regions and cultures. Unlike existing stance datasets, we introduce a more fine-grained 5-class labeling strategy with additional subcategories to distinguish implicit stance. Pre-trained transformer-based stance classifiers that are fine-tuned on our corpus show good generalization on unseen claims and regional claims from countries outside the training data. Cross-lingual experiments demonstrate Stanceosaurus' capability of training multi-lingual models, achieving 53.1 F1 on Hindi and 50.4 F1 on Arabic without any target-language fine-tuning. Finally, we show how a domain adaptation method can be used to improve performance on Stanceosaurus using additional RumourEval-2019 data. We make Stanceosaurus publicly available to the research community and hope it will encourage further work on misinformation identification across languages and cultures.