论文标题

高维随机森林的渐近特性

Asymptotic Properties of High-Dimensional Random Forests

论文作者

Chi, Chien-Ming, Vossler, Patrick, Fan, Yingying, Lv, Jinchi

论文摘要

作为一种灵活的非参数学习工具,即使在存在高维特征空间的情况下,随机森林算法也已被广泛应用于具有吸引人的经验性能的各种真实应用。揭示了基本机制,导致了一些关于随机森林算法及其变体一致性的最新理论结果。但是,据我们所知,几乎所有有关在高维环境中随机森林一致性的现有作品都是针对各种修改的随机森林模型建立的,在这些模型中,分裂规则独立于响应。一些例外假设具有二进制功能的简单数据生成模型。鉴于此,在本文中,我们得出了与样品推车拆分标准相关的随机森林算法的一致性速率,该算法是该算法原始版本中使用的算法,这是通过偏置偏置分解分析的一般高维非参数回归设置。我们的新理论结果表明,随机森林确实可以适应高维度并允许不连续的回归函数。我们的偏见分析明确表征了随机森林的偏见如何取决于样本量,树高和色谱柱亚采样参数。还讨论了我们当前结果的一些局限性。

As a flexible nonparametric learning tool, the random forests algorithm has been widely applied to various real applications with appealing empirical performance, even in the presence of high-dimensional feature space. Unveiling the underlying mechanisms has led to some important recent theoretical results on the consistency of the random forests algorithm and its variants. However, to our knowledge, almost all existing works concerning random forests consistency in high dimensional setting were established for various modified random forests models where the splitting rules are independent of the response; a few exceptions assume simple data generating models with binary features. In light of this, in this paper we derive the consistency rates for the random forests algorithm associated with the sample CART splitting criterion, which is the one used in the original version of the algorithm, in a general high-dimensional nonparametric regression setting through a bias-variance decomposition analysis. Our new theoretical results show that random forests can indeed adapt to high dimensionality and allow for discontinuous regression function. Our bias analysis characterizes explicitly how the random forests bias depends on the sample size, tree height, and column subsampling parameter. Some limitations on our current results are also discussed.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源