论文标题
PushShift Reddit数据集
The Pushshift Reddit Dataset
论文作者
论文摘要
社交媒体数据已对发展科学理解至关重要。但是,即使它变得无处不在,但仅收集大规模的社交媒体数据涉及高度的工程技能集和计算资源。实际上,在分析进行之前必须克服的数据工程问题通常是通过数据进行的。这使数据集成为有意义的研究贡献。 Reddit,即所谓的“互联网的头版”,尤其是众多科学研究的主题。尽管与Facebook和Twitter等社交媒体平台相比,Reddit相对向数据获取开放,但仍然存在收购的技术障碍。因此,雷迪特(Reddit)的数百万个子列表,数亿用户和数亿条评论同时相对易于访问,但很耗时,需要系统地收集和分析。在本文中,我们介绍PushShift Reddit数据集。 PushShift是一个社交媒体数据收集,分析和归档平台,自2015年以来,它收集了Reddit数据并将其提供给研究人员。 PushShift的REDDIT数据集实时更新,并将历史数据包含回Reddit的成立。除每月垃圾场外,PushShift还提供了计算工具,以帮助搜索,汇总和对整个数据集进行探索性分析。 PushShift Reddit数据集使社交媒体研究人员可以减少其项目收集,清洁和存储阶段所花费的时间。
Social media data has become crucial to the advancement of scientific understanding. However, even though it has become ubiquitous, just collecting large-scale social media data involves a high degree of engineering skill set and computational resources. In fact, research is often times gated by data engineering problems that must be overcome before analysis can proceed. This has resulted recognition of datasets as meaningful research contributions in and of themselves. Reddit, the so called "front page of the Internet," in particular has been the subject of numerous scientific studies. Although Reddit is relatively open to data acquisition compared to social media platforms like Facebook and Twitter, the technical barriers to acquisition still remain. Thus, Reddit's millions of subreddits, hundreds of millions of users, and hundreds of billions of comments are at the same time relatively accessible, but time consuming to collect and analyze systematically. In this paper, we present the Pushshift Reddit dataset. Pushshift is a social media data collection, analysis, and archiving platform that since 2015 has collected Reddit data and made it available to researchers. Pushshift's Reddit dataset is updated in real-time, and includes historical data back to Reddit's inception. In addition to monthly dumps, Pushshift provides computational tools to aid in searching, aggregating, and performing exploratory analysis on the entirety of the dataset. The Pushshift Reddit dataset makes it possible for social media researchers to reduce time spent in the data collection, cleaning, and storage phases of their projects.