Nesterov在训练过度参数化的神经网络中，Nesterov加速梯度方法的可证明加速

论文标题

Nesterov在训练过度参数化的神经网络中，Nesterov加速梯度方法的可证明加速

Provable Acceleration of Nesterov's Accelerated Gradient Method over Heavy Ball Method in Training Over-Parameterized Neural Networks

论文作者

Liu, Xin, Tao, Wei, Li, Wei, Zhan, Dazhi, Wang, Jun, Pan, Zhisong

论文摘要

由于其简单性和效率，一阶梯度方法已广泛用于训练神经网络。尽管神经网络的优化问题是非凸的，但最近的研究证明，在训练过度参数化的神经网络中，一阶方法能够达到全球最小值，在训练中，参数的数量明显大于训练实例的参数。动量方法，包括重球（HB）方法和Nesterov的加速梯度（NAG）方法，是拥有其加速收敛的一阶梯度方法的主力。实际上，NAG通常比HB表现出更高的性能。但是，当前的理论工作未能区分其在训练神经网络中的收敛差异。为了填补这一空白，我们考虑了在过度参数化和随机初始化下的两层relu神经网络的训练问题。利用高分辨率动力学系统和神经切线内核（NTK）理论，我们的结果不仅建立了HB和NAG的收敛速率的更高上限，而且还为NAG在训练神经网络中的HB加速而提供了第一个理论保证。最后，我们在三个基准数据集上验证了我们的理论结果。

Due to its simplicity and efficiency, the first-order gradient method has been extensively employed in training neural networks. Although the optimization problem of the neural network is non-convex, recent research has proved that the first-order method is capable of attaining a global minimum during training over-parameterized neural networks, where the number of parameters is significantly larger than that of training instances. Momentum methods, including the heavy ball (HB) method and Nesterov's accelerated gradient (NAG) method, are the workhorse of first-order gradient methods owning to their accelerated convergence. In practice, NAG often exhibits superior performance than HB. However, current theoretical works fail to distinguish their convergence difference in training neural networks. To fill this gap, we consider the training problem of the two-layer ReLU neural network under over-parameterization and random initialization. Leveraging high-resolution dynamical systems and neural tangent kernel (NTK) theory, our result not only establishes tighter upper bounds of the convergence rate for both HB and NAG, but also provides the first theoretical guarantee for the acceleration of NAG over HB in training neural networks. Finally, we validate our theoretical results on three benchmark datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题