论文标题
高尺寸的低排名数据的近似交叉验证
Approximate Cross-Validation with Low-Rank Data in High Dimensions
论文作者
论文摘要
机器学习的许多最新进展是由具有挑战性的Trifecta驱动的:大数据尺寸$ n $;高尺寸;和昂贵的算法。在这种情况下,交叉验证(CV)是模型评估的重要工具。近似交叉验证(ACV)的最新进展仅通过单个模型拟合提供了准确的简历近似值,从而避免了传统的简历对重复的昂贵算法的重复运行。不幸的是,这些ACV方法在高维度中可能会失去速度和准确性 - 除非数据中存在稀疏结构。幸运的是,在大多数数据中存在一种替代类型的简化结构:近似低等级(ALR)。在此观察结果的指导下,我们开发了一种用于ACV的新算法,该算法在存在ALR数据的情况下是快速准确的。我们的第一个关键见解是,Hessian矩阵(其逆形成现有ACV方法的计算瓶颈)是ALR。我们表明,尽管我们使用了\ emph {iNverse} Hessian,但使用最大(而不是最小)矩阵特征值的低级别近似值可以快速,可靠的ACV。我们的第二个关键见解是,在存在ALR数据的情况下,现有ACV方法中的误差大致随(近似,低)等级而不是(完整,高)维度生长。这些见解使我们能够证明理论保证我们所提出的算法的质量 - 以及其错误的快速计算上限。我们证明了我们方法的速度和准确性以及界限对一系列真实和模拟数据集的实用性。
Many recent advances in machine learning are driven by a challenging trifecta: large data size $N$; high dimensions; and expensive algorithms. In this setting, cross-validation (CV) serves as an important tool for model assessment. Recent advances in approximate cross validation (ACV) provide accurate approximations to CV with only a single model fit, avoiding traditional CV's requirement for repeated runs of expensive algorithms. Unfortunately, these ACV methods can lose both speed and accuracy in high dimensions -- unless sparsity structure is present in the data. Fortunately, there is an alternative type of simplifying structure that is present in most data: approximate low rank (ALR). Guided by this observation, we develop a new algorithm for ACV that is fast and accurate in the presence of ALR data. Our first key insight is that the Hessian matrix -- whose inverse forms the computational bottleneck of existing ACV methods -- is ALR. We show that, despite our use of the \emph{inverse} Hessian, a low-rank approximation using the largest (rather than the smallest) matrix eigenvalues enables fast, reliable ACV. Our second key insight is that, in the presence of ALR data, error in existing ACV methods roughly grows with the (approximate, low) rank rather than with the (full, high) dimension. These insights allow us to prove theoretical guarantees on the quality of our proposed algorithm -- along with fast-to-compute upper bounds on its error. We demonstrate the speed and accuracy of our method, as well as the usefulness of our bounds, on a range of real and simulated data sets.