论文标题

分数不足的兰格文动力学:在重尾梯度噪声下具有动量的重新定位SGD

Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise

论文作者

Şimşekli, Umut, Zhu, Lingjiong, Teh, Yee Whye, Gürbüzbalaban, Mert

论文摘要

动量(SGDM)的随机梯度下降是深度学习中最流行的优化算法之一。尽管有一个丰富的SGDM理论用于凸问题,但在深度学习的背景下,该理论的发展较少,在该问题的情况下,该理论是非凸的,梯度噪声可能表现出重尾行为,正如在最近的研究中经验观察到的那样。在这项研究中,我们考虑了SGDM的\ emph {连续时间}变体,称为失水不足的langevin Dynamics(ULD),并研究其在重尾扰动下的渐近特性。在统计物理学的最新研究的支持下,我们在理论和经验上都认为这种扰动的重尾可能会导致偏见,即使阶梯尺寸很小,从某种意义上说,动态的固定分布的最佳}可能与\ emph {具有优化功能的最佳功能}。作为一种补救措施,我们开发了一个新颖的框架,我们将其作为\ emph {ractional} uld(fuld)创建,并证明FULD针对所谓的Gibbs分布,其Optima与原始成本的最佳状态完全匹配。我们观察到,Fuld的Euler离散化与\ emph {自然梯度}方法和\ emph {渐变剪接}具有值得注意的算法相似性,从而为了解它们在深度学习中的作用带来了新的观点。我们通过在合成模型和神经网络上进行的实验来支持我们的理论。

Stochastic gradient descent with momentum (SGDm) is one of the most popular optimization algorithms in deep learning. While there is a rich theory of SGDm for convex problems, the theory is considerably less developed in the context of deep learning where the problem is non-convex and the gradient noise might exhibit a heavy-tailed behavior, as empirically observed in recent studies. In this study, we consider a \emph{continuous-time} variant of SGDm, known as the underdamped Langevin dynamics (ULD), and investigate its asymptotic properties under heavy-tailed perturbations. Supported by recent studies from statistical physics, we argue both theoretically and empirically that the heavy-tails of such perturbations can result in a bias even when the step-size is small, in the sense that \emph{the optima of stationary distribution} of the dynamics might not match \emph{the optima of the cost function to be optimized}. As a remedy, we develop a novel framework, which we coin as \emph{fractional} ULD (FULD), and prove that FULD targets the so-called Gibbs distribution, whose optima exactly match the optima of the original cost. We observe that the Euler discretization of FULD has noteworthy algorithmic similarities with \emph{natural gradient} methods and \emph{gradient clipping}, bringing a new perspective on understanding their role in deep learning. We support our theory with experiments conducted on a synthetic model and neural networks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源