论文标题
实用的准Newton方法用于培训深神经网络
Practical Quasi-Newton Methods for Training Deep Neural Networks
论文作者
论文摘要
我们考虑了实用随机准牛顿的发展,特别是Kronecker因块 - 二基因BFG和L-BFGS方法的发展,用于培训深神经网络(DNNS)。在DNN培训中,梯度$ n $的变量和组件的数量通常是数千万的订单,而Hessian的$ n^2 $元素。因此,计算和存储完整的$ n \ times n $ bfg近似或存储一个适度的(步骤,更改梯度)向量对,用于L-BFGS实现,这是不可能的。在我们提出的方法中,我们通过一个块对基因矩阵近似于Hessian,并使用梯度和Hessian的结构进一步近似这些块,每个块都与一个层相对应,作为两个较小矩阵的kronecker产物。这类似于KFAC中的方法,KFAC在随机天然梯度方法中计算了针对Fisher基质的Kronecker构成块对基近似。由于Hessian在DNN中的无限性和高度可变性质,因此我们还提出了一种新的阻尼方法,以保持BFGS和L-BFGS近似值的上限以及下限。在对具有九个或13层应用于三个数据集的自动编码器前馈神经网络模型的测试中,我们的方法优于KFAC和最先进的一阶随机方法。
We consider the development of practical stochastic quasi-Newton, and in particular Kronecker-factored block-diagonal BFGS and L-BFGS methods, for training deep neural networks (DNNs). In DNN training, the number of variables and components of the gradient $n$ is often of the order of tens of millions and the Hessian has $n^2$ elements. Consequently, computing and storing a full $n \times n$ BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an L-BFGS implementation is out of the question. In our proposed methods, we approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices. This is analogous to the approach in KFAC, which computes a Kronecker-factored block-diagonal approximation to the Fisher matrix in a stochastic natural gradient method. Because the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and L-BFGS approximations bounded. In tests on autoencoder feed-forward neural network models with either nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and state-of-the-art first-order stochastic methods.