论文标题
非参数特征的影响和重要性
Nonparametric Feature Impact and Importance
论文作者
论文摘要
从业人员在模型开发过程中使用特征重要性来对模型开发中的弱预测变量进行排名和消除,以简化模型并提高通用性。不幸的是,它们还常规地将这种特征重要性度量与特征影响(解释变量对响应变量的孤立效果)相结合。当将重要性不当解释为对商业或医疗见解目的的影响时,这可能会导致现实世界的后果。计算重要性的主要方法是通过审问拟合模型,该模型适合特征选择,但给出了扭曲的特征影响度量。根据模型,将相同的方法应用于相同的数据集可以产生不同的特征重要性,这使我们得出结论,应直接从数据中计算影响。尽管存在非参数特征选择算法,但它们通常提供特征排名,而不是影响或重要性的度量。他们通常还专注于与响应的单变量关联。在本文中,我们给出了直接在数据上运行的部分依赖曲线的功能影响和重要性的数学定义。为了评估质量,我们表明,使用这些定义排名的功能与现有功能选择技术具有竞争力,并使用三个实现预测任务的真实数据集竞争。
Practitioners use feature importance to rank and eliminate weak predictors during model development in an effort to simplify models and improve generality. Unfortunately, they also routinely conflate such feature importance measures with feature impact, the isolated effect of an explanatory variable on the response variable. This can lead to real-world consequences when importance is inappropriately interpreted as impact for business or medical insight purposes. The dominant approach for computing importances is through interrogation of a fitted model, which works well for feature selection, but gives distorted measures of feature impact. The same method applied to the same data set can yield different feature importances, depending on the model, leading us to conclude that impact should be computed directly from the data. While there are nonparametric feature selection algorithms, they typically provide feature rankings, rather than measures of impact or importance. They also typically focus on single-variable associations with the response. In this paper, we give mathematical definitions of feature impact and importance, derived from partial dependence curves, that operate directly on the data. To assess quality, we show that features ranked by these definitions are competitive with existing feature selection techniques using three real data sets for predictive tasks.