论文标题

因变量的离散噪声对软件工程中机器学习分类器的影响

Impact of Discretization Noise of the Dependent variable on Machine Learning Classifiers in Software Engineering

论文作者

Rajbahadur, Gopi Krishnan, Wang, Shaowei, Kamei, Yasutaka, Hassan, Ahmed E.

论文摘要

研究人员通常通过引入人工离散化阈值(例如中位数)将连续的因变量离散为两个目标类别。但是,由于模棱两可的数据点的忠诚度,这种离散化可能会引入噪声(即离散化噪声),这些数据点接近人工阈值。先前的研究并未提供有关离散化噪声对分类器以及如何处理此类噪声的影响的明确指令。在本文中,我们提出了一个框架,以帮助研究人员和从业人员系统地估计离散化噪声对分类器对各种绩效指标和分类器的解释的影响。通过对7个软件工程数据集的案例研究,我们发现:1)离散化噪声会影响不同数据集的分类器的不同性能度量; 2)尽管分类器的解释总体上受离散化噪声的影响,但前三名最重要的特征不受离散噪声的影响。因此,我们建议从业人员和研究人员使用我们的框架来了解离散化噪声对其内置分类器性能的影响,并估算从数据集中丢弃的离散噪声的确切量,以避免这种噪声的负面影响。

Researchers usually discretize a continuous dependent variable into two target classes by introducing an artificial discretization threshold (e.g., median). However, such discretization may introduce noise (i.e., discretization noise) due to ambiguous class loyalty of data points that are close to the artificial threshold. Previous studies do not provide a clear directive on the impact of discretization noise on the classifiers and how to handle such noise. In this paper, we propose a framework to help researchers and practitioners systematically estimate the impact of discretization noise on classifiers in terms of its impact on various performance measures and the interpretation of classifiers. Through a case study of 7 software engineering datasets, we find that: 1) discretization noise affects the different performance measures of a classifier differently for different datasets; 2) Though the interpretation of the classifiers are impacted by the discretization noise on the whole, the top 3 most important features are not affected by the discretization noise. Therefore, we suggest that practitioners and researchers use our framework to understand the impact of discretization noise on the performance of their built classifiers and estimate the exact amount of discretization noise to be discarded from the dataset to avoid the negative impact of such noise.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源