论文标题
破解双盲评论:深度学习的作者归因
Cracking Double-Blind Review: Authorship Attribution with Deep Learning
论文作者
论文摘要
双盲同行评审被认为是学术研究的支柱,因为它被认为可以确保公平,公正和以事实为中心的科学讨论。然而,经验丰富的研究人员通常可以正确猜测匿名提交的研究小组的起源,从而偏向同行评审过程。在这项工作中,我们提出了一个基于变压器的神经网络架构,该架构仅使用文本内容和参考书目中的作者名称将匿名手稿归因于作者。为了培训和评估我们的方法,我们创建了迄今为止最大的作者身份标识数据集。它利用所有关于ARXIV公开提供的研究论文,总计超过200万个手稿。在具有多达2,000种不同作者的Arxiv-subset中,我们的方法实现了前所未有的作者归因准确性,其中最多可以正确归因于73%的论文。我们提出了缩放分析,以突出提出的方法对更大的数据集的适用性,当学术界更广泛地使用足够的计算功能时。此外,我们分析了目标是确定匿名手稿的所有作者的设置中的归因精度。多亏了我们的方法,我们不仅能够预测一项匿名作品的作者,而且还提供了使论文归因的关键方面的经验证据。我们已经开源了必要的工具来重现我们的实验。
Double-blind peer review is considered a pillar of academic research because it is perceived to ensure a fair, unbiased, and fact-centered scientific discussion. Yet, experienced researchers can often correctly guess from which research group an anonymous submission originates, biasing the peer-review process. In this work, we present a transformer-based, neural-network architecture that only uses the text content and the author names in the bibliography to attribute an anonymous manuscript to an author. To train and evaluate our method, we created the largest authorship identification dataset to date. It leverages all research papers publicly available on arXiv amounting to over 2 million manuscripts. In arXiv-subsets with up to 2,000 different authors, our method achieves an unprecedented authorship attribution accuracy, where up to 73% of papers are attributed correctly. We present a scaling analysis to highlight the applicability of the proposed method to even larger datasets when sufficient compute capabilities are more widely available to the academic community. Furthermore, we analyze the attribution accuracy in settings where the goal is to identify all authors of an anonymous manuscript. Thanks to our method, we are not only able to predict the author of an anonymous work, but we also provide empirical evidence of the key aspects that make a paper attributable. We have open-sourced the necessary tools to reproduce our experiments.