分数不足的兰格文动力学：在重尾梯度噪声下具有动量的重新定位SGD

论文标题

分数不足的兰格文动力学：在重尾梯度噪声下具有动量的重新定位SGD

Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise

论文作者

Şimşekli, Umut, Zhu, Lingjiong, Teh, Yee Whye, Gürbüzbalaban, Mert

论文摘要

动量（SGDM）的随机梯度下降是深度学习中最流行的优化算法之一。尽管有一个丰富的SGDM理论用于凸问题，但在深度学习的背景下，该理论的发展较少，在该问题的情况下，该理论是非凸的，梯度噪声可能表现出重尾行为，正如在最近的研究中经验观察到的那样。在这项研究中，我们考虑了SGDM的\ emph {连续时间}变体，称为失水不足的langevin Dynamics（ULD），并研究其在重尾扰动下的渐近特性。在统计物理学的最新研究的支持下，我们在理论和经验上都认为这种扰动的重尾可能会导致偏见，即使阶梯尺寸很小，从某种意义上说，动态的固定分布的最佳}可能与\ emph {具有优化功能的最佳功能}。作为一种补救措施，我们开发了一个新颖的框架，我们将其作为\ emph {ractional} uld（fuld）创建，并证明FULD针对所谓的Gibbs分布，其Optima与原始成本的最佳状态完全匹配。我们观察到，Fuld的Euler离散化与\ emph {自然梯度}方法和\ emph {渐变剪接}具有值得注意的算法相似性，从而为了解它们在深度学习中的作用带来了新的观点。我们通过在合成模型和神经网络上进行的实验来支持我们的理论。

Stochastic gradient descent with momentum (SGDm) is one of the most popular optimization algorithms in deep learning. While there is a rich theory of SGDm for convex problems, the theory is considerably less developed in the context of deep learning where the problem is non-convex and the gradient noise might exhibit a heavy-tailed behavior, as empirically observed in recent studies. In this study, we consider a \emph{continuous-time} variant of SGDm, known as the underdamped Langevin dynamics (ULD), and investigate its asymptotic properties under heavy-tailed perturbations. Supported by recent studies from statistical physics, we argue both theoretically and empirically that the heavy-tails of such perturbations can result in a bias even when the step-size is small, in the sense that \emph{the optima of stationary distribution} of the dynamics might not match \emph{the optima of the cost function to be optimized}. As a remedy, we develop a novel framework, which we coin as \emph{fractional} ULD (FULD), and prove that FULD targets the so-called Gibbs distribution, whose optima exactly match the optima of the original cost. We observe that the Euler discretization of FULD has noteworthy algorithmic similarities with \emph{natural gradient} methods and \emph{gradient clipping}, bringing a new perspective on understanding their role in deep learning. We support our theory with experiments conducted on a synthetic model and neural networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题