论文标题
Banfakenews:一个用于检测孟加拉假新闻的数据集
BanFakeNews: A Dataset for Detecting Fake News in Bangla
论文作者
论文摘要
观察通过在政治和金融等各个部门快速传播假新闻可以造成的损害,使用语言分析自动识别假新闻已经吸引了研究界的注意。但是,这种方法在很大程度上是针对英语开发的,而低资源语言仍然不在焦点之外。但是,假和操纵新闻产生的风险不受语言的限制。在这项工作中,我们提出了一个约50k新闻的注释数据集,该数据集可用于为像孟加拉这样的低资源语言构建自动化的假新闻检测系统。此外,我们还提供了数据集的分析,并使用最先进的NLP技术开发基准系统来识别Bangla假新闻。为了创建此系统,我们探讨了传统的语言特征和基于神经网络的方法。我们希望该数据集将是建立技术的宝贵资源,以防止假新闻传播并用低资源语言进行研究。
Observing the damages that can be done by the rapid propagation of fake news in various sectors like politics and finance, automatic identification of fake news using linguistic analysis has drawn the attention of the research community. However, such methods are largely being developed for English where low resource languages remain out of the focus. But the risks spawned by fake and manipulative news are not confined by languages. In this work, we propose an annotated dataset of ~50K news that can be used for building automated fake news detection systems for a low resource language like Bangla. Additionally, we provide an analysis of the dataset and develop a benchmark system with state of the art NLP techniques to identify Bangla fake news. To create this system, we explore traditional linguistic features and neural network based methods. We expect this dataset will be a valuable resource for building technologies to prevent the spreading of fake news and contribute in research with low resource languages.