结构化非凸功能的SGD：学习率，小匹配和插值

论文标题

结构化非凸功能的SGD：学习率，小匹配和插值

SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and Interpolation

论文作者

Gower, Robert M., Sebbouh, Othmane, Loizou, Nicolas

论文摘要

随机梯度下降（SGD）通常用于优化非凸功能。然而，在平滑的非凸设置中SGD的标准收敛理论可为固定点慢慢地收敛。在这项工作中，我们为SGD提供了几种融合定理，显示了满足一些额外结构假设的非凸面问题的全球最小值。特别是，我们专注于两种大型的结构化非凸功能：（i）Quasar（强）凸功能（凸函数的概括）和（ii）满足Polyak-lojasiewicz条件的功能（强烈函数的概括）。我们的分析依赖于预期的残留条件，我们表明的假设严格弱，比以前使用的生长条件，预期的平滑度或有界方差假设。我们为不同的台阶选择，包括常数，减少和最近提出的随机Polyak阶梯尺寸，为SGD收敛提供了理论保证。此外，我们所有的分析都适用于任意采样范式，因此，我们深入了解了小型匹配的复杂性并确定最佳的Minibatch尺寸。最后，我们表明，对于插入培训数据的模型，我们可以分配预期的剩余条件，并在这种情况下给出最先进的结果。

Stochastic Gradient Descent (SGD) is being used routinely for optimizing non-convex functions. Yet, the standard convergence theory for SGD in the smooth non-convex setting gives a slow sublinear convergence to a stationary point. In this work, we provide several convergence theorems for SGD showing convergence to a global minimum for non-convex problems satisfying some extra structural assumptions. In particular, we focus on two large classes of structured non-convex functions: (i) Quasar (Strongly) Convex functions (a generalization of convex functions) and (ii) functions satisfying the Polyak-Lojasiewicz condition (a generalization of strongly-convex functions). Our analysis relies on an Expected Residual condition which we show is a strictly weaker assumption than previously used growth conditions, expected smoothness or bounded variance assumptions. We provide theoretical guarantees for the convergence of SGD for different step-size selections including constant, decreasing and the recently proposed stochastic Polyak step-size. In addition, all of our analysis holds for the arbitrary sampling paradigm, and as such, we give insights into the complexity of minibatching and determine an optimal minibatch size. Finally, we show that for models that interpolate the training data, we can dispense of our Expected Residual condition and give state-of-the-art results in this setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题