亚当和阿达格拉德的简单融合证明

论文标题

亚当和阿达格拉德的简单融合证明

A Simple Convergence Proof of Adam and Adagrad

论文作者

Défossez, Alexandre, Bottou, Léon, Bach, Francis, Usunier, Nicolas

论文摘要

当将ADAM和ADAGRAD自适应优化算法同时应用于具有有限梯度的平滑（可能是非凸）目标函数时，我们提供了一个简单的收敛证明。我们表明，在预期的是，在轨迹上平均的客观梯度的平方规范具有上限，在问题的常数，优化器的参数，尺寸$ d $和迭代总数$ n $中是明确的。该界限可以任意地使其小，并且使用正确的超参数，ADAM可以证明可以以相同的收敛速率$ O（d \ ln（n）/\ sqrt {n}）$收敛。但是，当与默认参数一起使用时，Adam不会收敛，并且就像常量的步进大小SGD一样，它比Adagrad更快地远离初始化点，这可能解释了其实际成功。最后，在所有以前的非convex Adam和Adagrad的融合边界中，我们获得了对重球动量衰减率$β_1$的最严密依赖性，从$ O（（1-β_1）^{ - 3}）$提高到$ O（（1-β_1）^{ - 1}）$。

We provide a simple proof of convergence covering both the Adam and Adagrad adaptive optimization algorithms when applied to smooth (possibly non-convex) objective functions with bounded gradients. We show that in expectation, the squared norm of the objective gradient averaged over the trajectory has an upper-bound which is explicit in the constants of the problem, parameters of the optimizer, the dimension $d$, and the total number of iterations $N$. This bound can be made arbitrarily small, and with the right hyper-parameters, Adam can be shown to converge with the same rate of convergence $O(d\ln(N)/\sqrt{N})$. When used with the default parameters, Adam doesn't converge, however, and just like constant step-size SGD, it moves away from the initialization point faster than Adagrad, which might explain its practical success. Finally, we obtain the tightest dependency on the heavy ball momentum decay rate $β_1$ among all previous convergence bounds for non-convex Adam and Adagrad, improving from $O((1-β_1)^{-3})$ to $O((1-β_1)^{-1})$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题