假新闻数据收集和分类：具有伪相关性反馈的不透明搜索引擎的迭代查询选择

论文标题

假新闻数据收集和分类：具有伪相关性反馈的不透明搜索引擎的迭代查询选择

Fake News Data Collection and Classification: Iterative Query Selection for Opaque Search Engines with Pseudo Relevance Feedback

论文作者

Elyashar, Aviad, Reuben, Maor, Puzis, Rami

论文摘要

从在线搜索引擎中检索信息是许多数据挖掘任务中的第一个也是最重要的步骤。当前在网络上可用的大多数搜索引擎（包括所有社交媒体平台）都是支持简短关键字查询的Black-Box（又称不透明）。在这些设置中，检索所有帖子和评论会自动讨论特定新闻项目，这是一项艰巨的任务。在本文中，我们提出了一种给定原型文档生成简短关键字查询的方法。所提出的迭代查询选择算法（IQS）与不透明的搜索引擎进行互动以迭代改进查询。它在Twitter TREC微博2012和TREC-Covid 2019数据集中进行了评估，与最先进的数据相比，它显示出卓越的性能。智商适用于自动收集大约70k True和Fake News项目的大型假新闻数据集。该数据集公开可用于研究，包括Twitter批准格式中的2200万个帐户和6100万个推文。我们演示了数据集对实现最新性能的虚假新闻检测任务的有用性。

Retrieving information from an online search engine, is the first and most important step in many data mining tasks. Most of the search engines currently available on the web, including all social media platforms, are black-boxes (a.k.a opaque) supporting short keyword queries. In these settings, retrieving all posts and comments discussing a particular news item automatically and at large scales is a challenging task. In this paper, we propose a method for generating short keyword queries given a prototype document. The proposed iterative query selection algorithm (IQS) interacts with the opaque search engine to iteratively improve the query. It is evaluated on the Twitter TREC Microblog 2012 and TREC-COVID 2019 datasets showing superior performance compared to state-of-the-art. IQS is applied to automatically collect a large-scale fake news dataset of about 70K true and fake news items. The dataset, publicly available for research, includes more than 22M accounts and 61M tweets in Twitter approved format. We demonstrate the usefulness of the dataset for fake news detection task achieving state-of-the-art performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题