论文标题
非平滑非凸函数的常数阶跃随机梯度下降的收敛性
Convergence of constant step stochastic gradient descent for non-smooth non-convex functions
论文作者
论文摘要
本文研究了恒定阶跃随机梯度下降的渐近行为,以最大程度地减少未知函数F(定义为对非凸,非平滑,局部Lipschitz随机函数的期望。由于梯度可能不存在,因此被某个操作员替换:一个合理的选择是使用随机函数的Clarke subdfientiation的元素;另一个选择是著名的反向传播算法的输出,该算法在执业者中很流行,并且最近由Bolte and Pauwels研究了其属性[7]。由于所选运算符的期望通常不是平均函数的Clarke细分BF的元素,因此在文献中假定BF的Oracle可用。作为第一个结果,在本文中表明,几乎所有算法的初始化点都不需要这样的甲骨文。接下来,在较小的步长制度中,显示出算法的插值轨迹以概率(在紧凑的收敛意义上)收敛到差分包含的解决方案集。最后,将迭代视为马尔可夫链,其过渡内核由步骤大小索引,这表明内核的不变分布薄弱地收敛到该差分包含的一组不变分布,因为步长大小趋于零。这些结果表明,当步长较小时,具有较大概率时,迭代次数最终位于平均函数临界点的附近。
This paper studies the asymptotic behavior of the constant step Stochastic Gradient Descent for the minimization of an unknown function F , defined as the expectation of a non convex, non smooth, locally Lipschitz random function. As the gradient may not exist, it is replaced by a certain operator: a reasonable choice is to use an element of the Clarke subdifferential of the random function; an other choice is the output of the celebrated backpropagation algorithm, which is popular amongst practionners, and whose properties have recently been studied by Bolte and Pauwels [7]. Since the expectation of the chosen operator is not in general an element of the Clarke subdifferential BF of the mean function, it has been assumed in the literature that an oracle of BF is available. As a first result, it is shown in this paper that such an oracle is not needed for almost all initialization points of the algorithm. Next, in the small step size regime, it is shown that the interpolated trajectory of the algorithm converges in probability (in the compact convergence sense) towards the set of solutions of the differential inclusion. Finally, viewing the iterates as a Markov chain whose transition kernel is indexed by the step size, it is shown that the invariant distribution of the kernel converge weakly to the set of invariant distribution of this differential inclusion as the step size tends to zero. These results show that when the step size is small, with large probability, the iterates eventually lie in a neighborhood of the critical points of the mean function F .