有条件的特征对于混合数据的重要性

论文标题

有条件的特征对于混合数据的重要性

Conditional Feature Importance for Mixed Data

论文作者

Blesch, Kristin, Watson, David S., Wright, Marvin N.

论文摘要

尽管特征重要性（FI）在可解释的机器学习中的普及程度很高，但很少讨论这些方法的统计充分性。从统计的角度来看，在调整协变量之前和之后分析变量的重要性 - 即$ \ textit {marginal} $和$ \ textit {atresconal} $测量之间的主要区别。我们的作品引起了人们对这一鲜为人知但至关重要的区别的关注，并展示了它的含义。此外，我们透露，对于测试有条件的FI，只有很少的方法可用，并且由于数据要求不匹配，因此在方法应用中严格限制了从业者。大多数现实世界数据都表现出复杂的特征依赖性，并包含连续和分类数据（混合数据）。这两种属性通常都被条件FI措施忽略了。为了填补这一空白，我们建议将条件预测影响（CPI）框架与顺序的仿制采样相结合。 CPI启用有条件的FI测量，该测量通过对有效的仿冒品进行采样（因此，生成具有相似统计属性的合成数据）来控制任何特征依赖项，以分析数据。故意设计顺序仿制以处理混合数据，从而使我们可以将CPI方法扩展到此类数据集。我们通过众多仿真和一个现实世界的示例证明，我们提出的工作流控制I型错误，达到高功率，并且与其他条件FI措施给出的结果一致，而边际FI指标则导致误导性解释。我们的发现突出了为混合数据开发足够的专业方法的必要性。

Despite the popularity of feature importance (FI) measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analyzing a variable's importance before and after adjusting for covariates - i.e., between $\textit{marginal}$ and $\textit{conditional}$ measures. Our work draws attention to this rarely acknowledged, yet crucial distinction and showcases its implications. Further, we reveal that for testing conditional FI, only few methods are available and practitioners have hitherto been severely restricted in method application due to mismatching data requirements. Most real-world data exhibits complex feature dependencies and incorporates both continuous and categorical data (mixed data). Both properties are oftentimes neglected by conditional FI measures. To fill this gap, we propose to combine the conditional predictive impact (CPI) framework with sequential knockoff sampling. The CPI enables conditional FI measurement that controls for any feature dependencies by sampling valid knockoffs - hence, generating synthetic data with similar statistical properties - for the data to be analyzed. Sequential knockoffs were deliberately designed to handle mixed data and thus allow us to extend the CPI approach to such datasets. We demonstrate through numerous simulations and a real-world example that our proposed workflow controls type I error, achieves high power and is in line with results given by other conditional FI measures, whereas marginal FI metrics result in misleading interpretations. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题