NAG-GS：半涂抹，加速和鲁棒的随机优化器

论文标题

NAG-GS：半涂抹，加速和鲁棒的随机优化器

NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer

论文作者

Leplat, Valentin, Merkulov, Daniil, Katrutsa, Aleksandr, Bershatsky, Daniel, Tsymboi, Olga, Oseledets, Ivan

论文摘要

经典的机器学习模型（例如深神经网络）通常通过使用基于随机梯度下降（SGD）算法来训练。经典SGD可以解释为随机梯度流的离散化。在本文中，我们提出了一种新颖，鲁棒和加速的随机优化器，依赖两个关键要素：（1）加速的Nesterov样式随机微分方程（SDE）和（2）其半平整Gauss-Seidel型离散化。在最小化二次函数的情况下，首先对所谓的NAG-GS的收敛性和稳定性进行了广泛的研究。这种分析使我们能够根据收敛率提出最佳的学习率，同时确保NAG-GS的稳定性。这是通过仔细分析迭代矩阵的光谱半径和相对于我们方法的所有超参数的平稳性矩阵的。此外，我们表明，NAGGS具有最先进的方法，例如具有重量衰减的动量SGD和用于培训机器学习模型的ADAMW，例如逻辑回归模型，标准计算机视觉数据集中的残差网络模型，胶乐基准标准帧中的变压器和最新视觉变压器。

Classical machine learning models such as deep neural networks are usually trained by using Stochastic Gradient Descent-based (SGD) algorithms. The classical SGD can be interpreted as a discretization of the stochastic gradient flow. In this paper we propose a novel, robust and accelerated stochastic optimizer that relies on two key elements: (1) an accelerated Nesterov-like Stochastic Differential Equation (SDE) and (2) its semi-implicit Gauss-Seidel type discretization. The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively in the case of the minimization of a quadratic function. This analysis allows us to come up with an optimal learning rate in terms of the convergence rate while ensuring the stability of NAG-GS. This is achieved by the careful analysis of the spectral radius of the iteration matrix and the covariance matrix at stationarity with respect to all hyperparameters of our method. Further, we show that NAG- GS is competitive with state-of-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models such as the logistic regression model, the residual networks models on standard computer vision datasets, Transformers in the frame of the GLUE benchmark and the recent Vision Transformers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题