使用高维中的剩余交叉验证估算样本外预测错误时的错误界限

论文标题

使用高维中的剩余交叉验证估算样本外预测错误时的错误界限

Error bounds in estimating the out-of-sample prediction error using leave-one-out cross validation in high-dimensions

论文作者

Rad, Kamiar Rahnama, Zhou, Wenda, Maleki, Arian

论文摘要

我们研究了在高维度中的样本外风险估计的问题，在该状态下，样本量$ n $和功能数量$ p $都大，而$ n/p $的功能数量较小。广泛的经验证据证实了对样本外风险估计的剩余交叉验证（LO）的准确性。然而，对高维问题中LO准确性的理论评估统一的评估仍然是一个开放的问题。本文旨在填补这一空白，以使其在广义线性家族中进行惩罚。由于对数据生成过程的次要假设，并且在回归系数上没有任何稀疏性假设，我们的理论分析在估计样本外误差时获得了LO的预期平方误差的有限样本上限。我们的界限表明，即使功能向量的尺寸$ p $与样本量$ n $相当，该错误即使为$ n，p \ rightarrow \ infty $。该理论的一个技术优势是，它可以用来阐明并连接有关可扩展近似LO的文献中的一些结果。

We study the problem of out-of-sample risk estimation in the high dimensional regime where both the sample size $n$ and number of features $p$ are large, and $n/p$ can be less than one. Extensive empirical evidence confirms the accuracy of leave-one-out cross validation (LO) for out-of-sample risk estimation. Yet, a unifying theoretical evaluation of the accuracy of LO in high-dimensional problems has remained an open problem. This paper aims to fill this gap for penalized regression in the generalized linear family. With minor assumptions about the data generating process, and without any sparsity assumptions on the regression coefficients, our theoretical analysis obtains finite sample upper bounds on the expected squared error of LO in estimating the out-of-sample error. Our bounds show that the error goes to zero as $n,p \rightarrow \infty$, even when the dimension $p$ of the feature vectors is comparable with or greater than the sample size $n$. One technical advantage of the theory is that it can be used to clarify and connect some results from the recent literature on scalable approximate LO.

下载PDF全文

下载文献需遵守相关版权规定

论文标题