更快的随机梯度方法的自适应学习率

论文标题

更快的随机梯度方法的自适应学习率

Adaptive Learning Rates for Faster Stochastic Gradient Methods

论文作者

Horváth, Samuel, Mishchenko, Konstantin, Richtárik, Peter

论文摘要

在这项工作中，我们提出了新的自适应步长策略，以改善几种随机梯度方法。我们的第一种方法（停止）是基于经典的Polyak步长（Polyak，1987），是随机优化SPS（Loizou等，2021）的最新开发方法的扩展，而我们的第二种方法（表示的毕业生）通过“随机梯度的多样性”重新缩小步骤尺寸。我们为这些方法提供了强烈凸平的函数的理论分析，并表明尽管随机梯度随机梯度，它们仍享有确定性的速率。此外，我们证明了自适应方法对二次目标的理论优势。不幸的是，两个停止和毕业生都取决于未知数量，这仅适用于过度隔离的模型。为了解决这个问题，我们放弃了这种不希望的依赖性，并重新定义了停止和毕业生的停止和毕业。我们表明，这些新方法在相同的假设下线性收敛到最佳解决方案的邻域。最后，我们通过实验验证来证实我们的理论主张，这表明GRAD对于深度学习优化特别有用。

In this work, we propose new adaptive step size strategies that improve several stochastic gradient methods. Our first method (StoPS) is based on the classical Polyak step size (Polyak, 1987) and is an extension of the recent development of this method for the stochastic optimization-SPS (Loizou et al., 2021), and our second method, denoted GraDS, rescales step size by "diversity of stochastic gradients". We provide a theoretical analysis of these methods for strongly convex smooth functions and show they enjoy deterministic-like rates despite stochastic gradients. Furthermore, we demonstrate the theoretical superiority of our adaptive methods on quadratic objectives. Unfortunately, both StoPS and GraDS depend on unknown quantities, which are only practical for the overparametrized models. To remedy this, we drop this undesired dependence and redefine StoPS and GraDS to StoP and GraD, respectively. We show that these new methods converge linearly to the neighbourhood of the optimal solution under the same assumptions. Finally, we corroborate our theoretical claims by experimental validation, which reveals that GraD is particularly useful for deep learning optimization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题