论文标题
元声明:无限函数的广义全自适应方差降低了SGD
META-STORM: Generalized Fully-Adaptive Variance Reduced SGD for Unbounded Functions
论文作者
论文摘要
我们研究了降低方差(VR)技术在一般非凸的随机优化问题上的应用。在这种情况下,最近的工作风暴[Cutkosky-Orabona '19]克服了必须计算早期VR方法所依赖的“大型批次”梯度的缺点。在那儿,Storm利用递归动量来达到VR效应,然后在Storm+ [Levy等,'21]中完全自适应,其中全适应性消除了对某些特定问题参数的要求,例如目标的平稳性,以及在随机梯度方面的平稳性,以设定步骤大小。但是,Storm+至关重要的是函数值有界的假设,不包括大量有用函数。在这项工作中,我们提出了一个元模具,这是一个跨度的风暴+框架,它消除了有界的函数值假设,同时仍达到非凸优化的最佳收敛速率。 Meta-STORM不仅保持全适应性,从而消除了获得问题特定参数的需求,还可以提高收敛率对问题参数的依赖性。此外,元声明可以利用大量的参数设置,该设置包含以前的方法,从而可以在更广泛的设置范围内提高灵活性。最后,我们通过跨越常见的深度学习任务的实验来证明元声明的有效性。我们的算法改善了先前的工作风暴+,并且在增加每个坐标更新和指数移动平均启发式方法后具有广泛使用算法的竞争力。
We study the application of variance reduction (VR) techniques to general non-convex stochastic optimization problems. In this setting, the recent work STORM [Cutkosky-Orabona '19] overcomes the drawback of having to compute gradients of "mega-batches" that earlier VR methods rely on. There, STORM utilizes recursive momentum to achieve the VR effect and is then later made fully adaptive in STORM+ [Levy et al., '21], where full-adaptivity removes the requirement for obtaining certain problem-specific parameters such as the smoothness of the objective and bounds on the variance and norm of the stochastic gradients in order to set the step size. However, STORM+ crucially relies on the assumption that the function values are bounded, excluding a large class of useful functions. In this work, we propose META-STORM, a generalized framework of STORM+ that removes this bounded function values assumption while still attaining the optimal convergence rate for non-convex optimization. META-STORM not only maintains full-adaptivity, removing the need to obtain problem specific parameters, but also improves the convergence rate's dependency on the problem parameters. Furthermore, META-STORM can utilize a large range of parameter settings that subsumes previous methods allowing for more flexibility in a wider range of settings. Finally, we demonstrate the effectiveness of META-STORM through experiments across common deep learning tasks. Our algorithm improves upon the previous work STORM+ and is competitive with widely used algorithms after the addition of per-coordinate update and exponential moving average heuristics.