论文标题
随机森林回归器的数值变换纠正系统偏向的预测
A Numerical Transform of Random Forest Regressors corrects Systematically-Biased Predictions
论文作者
论文摘要
在过去的十年中,随机森林模型已被广泛用作高维数据回归任务的强大方法。在某种程度上,这些模型的受欢迎程度源于以下事实:它们几乎不需要高参数调整,并且不太容易过度拟合。随机的森林回归模型由决策树组成,它们独立预测了(连续)因变量的价值;最终对每棵树的预测进行平均,以产生森林的总体预测价值。使用一套代表性的现实世界数据集,我们发现了随机森林模型的预测中的系统偏见。我们发现,这种偏差是在简单的合成数据集中概括的,无论它们是否在数据中包含了不可还原误差(噪声),但是采用增强的模型都不会表现出这种偏见。在这里,我们证明了此问题的基础,并使用培训数据来定义完全纠正它的数值转换。在我们的研究中评估的每个现实世界和合成数据集中,这种转换的应用可改善预测。
Over the past decade, random forest models have become widely used as a robust method for high-dimensional data regression tasks. In part, the popularity of these models arises from the fact that they require little hyperparameter tuning and are not very susceptible to overfitting. Random forest regression models are comprised of an ensemble of decision trees that independently predict the value of a (continuous) dependent variable; predictions from each of the trees are ultimately averaged to yield an overall predicted value from the forest. Using a suite of representative real-world datasets, we find a systematic bias in predictions from random forest models. We find that this bias is recapitulated in simple synthetic datasets, regardless of whether or not they include irreducible error (noise) in the data, but that models employing boosting do not exhibit this bias. Here we demonstrate the basis for this problem, and we use the training data to define a numerical transformation that fully corrects it. Application of this transformation yields improved predictions in every one of the real-world and synthetic datasets evaluated in our study.