论文标题
关于自适应梯度算法的SDE和缩放规则
On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
论文作者
论文摘要
近似随机梯度下降(SGD)作为随机微分方程(SDE)使研究人员能够享受研究连续优化轨迹的好处,同时仔细保留SGD的随机性。对自适应梯度方法(例如RMSPROP和ADAM)的类似研究是具有挑战性的,因为这些方法没有严格证明的SDE近似。本文得出了RMSPROP和ADAM的SDE近似,提供了理论保证其正确性,以及对它们对共同大规模的愿景和语言环境的适用性的实验验证。一个关键的实际结果是$ \ textit {square root缩放规则} $的推导,以调整RMSPROP和ADAM的优化超参数,并在更改批处理大小时及其在深度学习设置中的经验验证。
Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $\textit{square root scaling rule}$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.