差异结果和差异降低的亚当的收敛性

论文标题

差异结果和差异降低的亚当的收敛性

Divergence Results and Convergence of a Variance Reduced Version of ADAM

论文作者

Wang, Ruiqi, Klabjan, Diego

论文摘要

使用过去梯度的指数移动平均值（例如Adam，RMSPROP和ADAGRAD）在许多应用中取得了巨大成功，尤其是在培训深层神经网络方面取得了巨大成功。亚当特别脱颖而出。尽管表现出色，但亚当已被证明在某些特定问题上是不同的。我们重新审视了分歧的问题，并在更强的条件下（例如预期或高概率）提供了不同的例子。在降低差异假设下，我们表明ADAM型算法会收敛，这意味着梯度的差异导致原始ADAM的差异。为此，我们提出了降低ADAM的差异，并提供了算法的收敛分析。数值实验表明，所提出的算法的性能与ADAM一样好。我们的工作提出了解决融合问题的新方向。

Stochastic optimization algorithms using exponential moving averages of the past gradients, such as ADAM, RMSProp and AdaGrad, have been having great successes in many applications, especially in training deep neural networks. ADAM in particular stands out as efficient and robust. Despite of its outstanding performance, ADAM has been proved to be divergent for some specific problems. We revisit the divergent question and provide divergent examples under stronger conditions such as in expectation or high probability. Under a variance reduction assumption, we show that an ADAM-type algorithm converges, which means that it is the variance of gradients that causes the divergence of original ADAM. To this end, we propose a variance reduced version of ADAM and provide a convergent analysis of the algorithm. Numerical experiments show that the proposed algorithm has as good performance as ADAM. Our work suggests a new direction for fixing the convergence issues.

下载PDF全文

下载文献需遵守相关版权规定

论文标题