论文标题
租金 - 重复的弹性网技术以进行功能选择
RENT -- Repeated Elastic Net Technique for Feature Selection
论文作者
论文摘要
特征选择是数据科学管道中的重要一步,以减少与大数据集相关的复杂性。尽管对该主题的大量研究着重于优化预测性能,但很少有研究研究特征选择过程中的稳定性。在这项研究中,我们介绍了重复的弹性净技术(租金),以进行特征选择。租金使用具有弹性网正规化的广义线性模型的合奏,每个模型均经过训练数据的不同子集训练。该特征选择基于三个标准,评估了所有基本模型中特征的重量分布。这个事实导致选择具有高稳定性的特征,从而提高了最终模型的鲁棒性。此外,与已建立的功能选择器不同,租金为模型解释提供了有价值的信息,以识别数据中难以预测的数据中的对象。在我们的实验中,我们对八个多变量数据集上的六个既定功能选择器进行基准租金,以进行二进制分类和回归。在实验比较中,租金显示了预测性能与稳定性之间的平衡折衷。最后,我们通过对医疗保健数据集进行探索性事后分析来强调租金的其他解释价值。
Feature selection is an essential step in data science pipelines to reduce the complexity associated with large datasets. While much research on this topic focuses on optimizing predictive performance, few studies investigate stability in the context of the feature selection process. In this study, we present the Repeated Elastic Net Technique (RENT) for Feature Selection. RENT uses an ensemble of generalized linear models with elastic net regularization, each trained on distinct subsets of the training data. The feature selection is based on three criteria evaluating the weight distributions of features across all elementary models. This fact leads to the selection of features with high stability that improve the robustness of the final model. Furthermore, unlike established feature selectors, RENT provides valuable information for model interpretation concerning the identification of objects in the data that are difficult to predict during training. In our experiments, we benchmark RENT against six established feature selectors on eight multivariate datasets for binary classification and regression. In the experimental comparison, RENT shows a well-balanced trade-off between predictive performance and stability. Finally, we underline the additional interpretational value of RENT with an exploratory post-hoc analysis of a healthcare dataset.