论文标题
低等级鞍免费牛顿:一种随机非convex优化的可扩展方法
Low Rank Saddle Free Newton: A Scalable Method for Stochastic Nonconvex Optimization
论文作者
论文摘要
在现代深度学习中,高度下采样的随机近似(SA)方法比样品平均近似(SAA)方法优选,因为数据集和概括属性很大。此外,由于认为形成和分解黑森的成本,这些问题不使用二阶方法。在这项工作中,我们激发了牛顿方法的扩展到SA制度,并主张使用可伸缩的低级鞍鞍牛顿(LRSFN)方法,该方法避免了形成黑森的组成,而是赞成进行低级近似。此外,LRSFN可以促进从不确定区域快速逃脱,从而提供更好的优化解决方案。在SA设置中,迭代更新以随机噪声为主,该方法的稳定性是关键。我们介绍了一个连续的时间稳定性分析框架,并使用它来证明牛顿方法的随机错误可以通过不良条件的黑姐妹大大放大。 LRSFN方法通过Levenberg-Marquardt阻尼减轻了这一稳定性问题。但是,通常分析表明,与确定性问题不同,使用随机黑森和梯度信息的二阶方法可能需要采取一些小步骤。数值结果表明,LRSFN可以逃避其他方法存在问题的不确定区域;即使在限制性的步长条件下,LRSFN也可以在大规模深度学习任务上胜过流行的一阶方法,从同等计算工作的通用性方面。
In modern deep learning, highly subsampled stochastic approximation (SA) methods are preferred to sample average approximation (SAA) methods because of large data sets as well as generalization properties. Additionally, due to perceived costs of forming and factorizing Hessians, second order methods are not used for these problems. In this work we motivate the extension of Newton methods to the SA regime, and argue for the use of the scalable low rank saddle free Newton (LRSFN) method, which avoids forming the Hessian in favor of making a low rank approximation. Additionally, LRSFN can facilitate fast escape from indefinite regions leading to better optimization solutions. In the SA setting, iterative updates are dominated by stochastic noise, and stability of the method is key. We introduce a continuous time stability analysis framework, and use it to demonstrate that stochastic errors for Newton methods can be greatly amplified by ill-conditioned Hessians. The LRSFN method mitigates this stability issue via Levenberg-Marquardt damping. However, generally the analysis shows that second order methods with stochastic Hessian and gradient information may need to take small steps, unlike in deterministic problems. Numerical results show that LRSFN can escape indefinite regions that other methods have issues with; and even under restrictive step length conditions, LRSFN can outperform popular first order methods on large scale deep learning tasks in terms of generalizability for equivalent computational work.