混杂裂口：机器学习中的混淆导致泄漏

论文标题

混杂裂口：机器学习中的混淆导致泄漏

Confound-leakage: Confound Removal in Machine Learning Leads to Leakage

论文作者

Hamdan, Sami, Love, Bradley C., von Polier, Georg G., Weis, Susanne, Schwender, Holger, Eickhoff, Simon B., Patil, Kaustubh R.

论文摘要

现在在包括流行病学和医学在内的许多领域中广泛采用了机器学习（ML）数据分析方法。要采用这些方法，必须首先通过在应用ML之前通过线性回归来删除其方差，首先要去除混杂。在这里，我们展示了这种混淆去除偏见的常见方法，从而导致了误导性结果。具体而言，这种常见的变形方法可能会泄漏信息，以便当非线性ML方法随后应用时，什么是无效或中度效应会扩大到接近完美的预测。我们确定并评估这种混杂裂口的可能机制，并提供了减轻其负面影响的实用指导。我们通过分析一个临床数据集来证明混淆裂变的现实重要性，在该数据集中高估了准确性，以预测注意力不足多动障碍（ADHD），而抑郁症则是混杂的。我们的结果对ML工作流的实施和部署具有广泛的影响，并谨慎谨慎使用标准混淆方法。

Machine learning (ML) approaches to data analysis are now widely adopted in many fields including epidemiology and medicine. To apply these approaches, confounds must first be removed as is commonly done by featurewise removal of their variance by linear regression before applying ML. Here, we show this common approach to confound removal biases ML models, leading to misleading results. Specifically, this common deconfounding approach can leak information such that what are null or moderate effects become amplified to near-perfect prediction when nonlinear ML approaches are subsequently applied. We identify and evaluate possible mechanisms for such confound-leakage and provide practical guidance to mitigate its negative impact. We demonstrate the real-world importance of confound-leakage by analyzing a clinical dataset where accuracy is overestimated for predicting attention deficit hyperactivity disorder (ADHD) with depression as a confound. Our results have wide-reaching implications for implementation and deployment of ML workflows and beg caution against naïve use of standard confound removal approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题