具有强凸功能的随机梯度下降的较大偏差率

论文标题

具有强凸功能的随机梯度下降的较大偏差率

Large deviations rates for stochastic gradient descent with strongly convex functions

论文作者

Bajovic, Dragana, Jakovetic, Dusan, Kar, Soummya

论文摘要

最近的工作表明，具有随机梯度下降（SGD）的高概率指标具有信息性，在某些情况下，优于通常采用的均方错误基于均方错误的指标。在这项工作中，我们根据大偏差理论为研究一般高概率边界的研究提供了正式的框架。该框架允许具有满足轻度技术假设的通用（不必要的）梯度噪声，从而允许噪声分布对当前迭代的依赖性。在前面的假设下，我们发现具有强烈凸功能的SGD结合的上部大偏差。相应的速率函数捕获了对噪声分布和其他问题参数的分析依赖性。这与常规的均方误差分析相反，该分析仅通过差异捕获噪声依赖性，并且没有捕获噪声几何形状和成本函数的形状之间的高阶效果或相互作用的效果。当目标函数是二次的情况下，我们还得出了确切的较大偏差速率，并表明所获得的函数与一般上限的函数匹配，因此显示了一般上限的紧密度。数值示例说明了理论发现和证实。

Recent works have shown that high probability metrics with stochastic gradient descent (SGD) exhibit informativeness and in some cases advantage over the commonly adopted mean-square error-based ones. In this work we provide a formal framework for the study of general high probability bounds with SGD, based on the theory of large deviations. The framework allows for a generic (not-necessarily bounded) gradient noise satisfying mild technical assumptions, allowing for the dependence of the noise distribution on the current iterate. Under the preceding assumptions, we find an upper large deviations bound for SGD with strongly convex functions. The corresponding rate function captures analytical dependence on the noise distribution and other problem parameters. This is in contrast with conventional mean-square error analysis that captures only the noise dependence through the variance and does not capture the effect of higher order moments nor interplay between the noise geometry and the shape of the cost function. We also derive exact large deviation rates for the case when the objective function is quadratic and show that the obtained function matches the one from the general upper bound hence showing the tightness of the general upper bound. Numerical examples illustrate and corroborate theoretical findings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题