用于检测GitHub问题和PR评论中的机器人的基础真相数据集和分类模型

论文标题

用于检测GitHub问题和PR评论中的机器人的基础真相数据集和分类模型

A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments

论文作者

Golzadeh, Mehdi, Decan, Alexandre, Legay, Damien, Mens, Tom

论文摘要

机器人经常在GITHUB存储库中使用，以自动化是分布式软件开发过程一部分的重复活动。他们通过评论与人类演员沟通。尽管由于许多原因检测其存在很重要，但没有可用的大型和代表性的基地数据集，也没有根据此类数据集检测和验证机器人的分类模型。本文提出了一个基于具有高度界面协议的手动分析的基础真实数据集，该数据集在5,000个不同的github帐户中发表评论，其中527个已确定为bot。使用此数据集，我们提出一个自动分类模型来检测机器人，作为主要特征每个帐户中的空和非空注评论的数量，评论模式的数量以及注释模式中注释之间的不平等。在包含40％数据的测试集中，我们获得了非常高的加权平均精度，召回率为0.98。我们将分类模型集成到开源命令行工具中，以允许从业者检测给定的GitHub存储库中的哪些帐户实际上对应于bot。

Bots are frequently used in Github repositories to automate repetitive activities that are part of the distributed software development process. They communicate with human actors through comments. While detecting their presence is important for many reasons, no large and representative ground-truth dataset is available, nor are classification models to detect and validate bots on the basis of such a dataset. This paper proposes a ground-truth dataset, based on a manual analysis with high interrater agreement, of pull request and issue comments in 5,000 distinct Github accounts of which 527 have been identified as bots. Using this dataset we propose an automated classification model to detect bots, taking as main features the number of empty and non-empty comments of each account, the number of comment patterns, and the inequality between comments within comment patterns. We obtained a very high weighted average precision, recall and F1-score of 0.98 on a test set containing 40% of the data. We integrated the classification model into an open source command-line tool to allow practitioners to detect which accounts in a given Github repository actually correspond to bots.

下载PDF全文

下载文献需遵守相关版权规定

论文标题