论文标题
CATISS:使用变压器对问题进行分类的智能工具
CatIss: An Intelligent Tool for Categorizing Issues Reports using Transformers
论文作者
论文摘要
用户使用问题跟踪系统来保留其存储库中的跟踪和管理问题报告。一个问题是丰富的软件信息来源,其中包含不同的报告,包括问题,新功能请求,或者只是有关软件产品的问题。随着这些问题的数量增加,手动管理它们变得更加困难。因此,提出了自动方法来帮助促进问题报告的管理。 本文描述了Catiss,这是一个自动的问题报告分类器,该分类器建立在基于变压器的预培训的Roberta模型上。 CATISS将问题报告分类为三个主要类别的错误报告,增强/功能请求和问题。首先,为NLBSE工具竞赛提供的数据集进行了清洁和预处理。然后,预训练的Roberta模型在预处理数据集上进行了微调。评估Catiss在GitHub的大约80,000期报告中,表明它的表现非常好,超过了竞争基线,Ticktagger,并且达到了87.2%的F1得分(微平均水平)。此外,随着Catiss经过广泛的存储库的培训,它是一个通用的预测模型,因此适用于任何看不见的软件项目或很少有历史数据的项目。可以公开使用用于清洁数据集,培训CATISS和评估模型的脚本。
Users use Issue Tracking Systems to keep track and manage issue reports in their repositories. An issue is a rich source of software information that contains different reports including a problem, a request for new features, or merely a question about the software product. As the number of these issues increases, it becomes harder to manage them manually. Thus, automatic approaches are proposed to help facilitate the management of issue reports. This paper describes CatIss, an automatic CATegorizer of ISSue reports which is built upon the Transformer-based pre-trained RoBERTa model. CatIss classifies issue reports into three main categories of Bug reports, Enhancement/feature requests, and Questions. First, the datasets provided for the NLBSE tool competition are cleaned and preprocessed. Then, the pre-trained RoBERTa model is fine-tuned on the preprocessed dataset. Evaluating CatIss on about 80 thousand issue reports from GitHub, indicates that it performs very well surpassing the competition baseline, TicketTagger, and achieving 87.2% F1-score (micro average). Additionally, as CatIss is trained on a wide set of repositories, it is a generic prediction model, hence applicable for any unseen software project or projects with little historical data. Scripts for cleaning the datasets, training CatIss, and evaluating the model are publicly available.