论文标题
多材料随机优化器
Multi-Iteration Stochastic Optimizers
论文作者
论文摘要
我们在这里介绍了多材料随机优化器,这是一种新型的一阶随机优化器,其中相对$ l^2 $误差是使用沿迭代路径的连续控制变体估算和控制的。通过利用迭代之间的相关性,控制变体可以减少估计器的方差,从而使平均梯度的准确估计变得可负担得起。我们命名平均梯度多介质随机估计器(小鼠)的估计器。原则上,鉴于其非侵入性的性质,可以灵活地将小鼠与任何一阶随机优化器相结合。我们的通用算法自适应地决定了迭代以保持其索引集。我们提出了小鼠的误差分析以及针对不同类别问题(包括某些非convex案例)的多介质随机优化者的收敛分析。在平稳,强烈凸的设置中,我们表明,要近似具有准确性$ tol $的最小化器,SGD鼠标平均需要$ O(TOL^{ - 1})$随机梯度评估,而具有自适应批量的SGD,而具有自适应批量的SGD则需要$ O(tol^{ - 1}} \ log(log log(log log(tol^^^^{-1})))。此外,在数值评估中,SGD鼠标的TOL达到了梯度评估的数量,而不是自适应批次SGD。小鼠估计器基于在一致性测试中验证的梯度规范提供了直接的停止标准。为了评估小鼠的效率,我们提出了几个示例,其中我们使用了SGD-MICE和ADAM-MICE。我们包括一个基于Rosenbrock函数的随机适应和各种数据集的逻辑回归培训的示例。与SGD,SAG,SAGA,SVRG和SARAH相比,多材料随机优化器减少了,而无需为每个示例调整参数,在所有测试的情况下,梯度采样成本在某些情况下在运行时也具有竞争力。
We here introduce Multi-Iteration Stochastic Optimizers, a novel class of first-order stochastic optimizers where the relative $L^2$ error is estimated and controlled using successive control variates along the path of iterations. By exploiting the correlation between iterates, control variates may reduce the estimator's variance so that an accurate estimation of the mean gradient becomes computationally affordable. We name the estimator of the mean gradient Multi-Iteration stochastiC Estimator (MICE). In principle, MICE can be flexibly coupled with any first-order stochastic optimizer, given its non-intrusive nature. Our generic algorithm adaptively decides which iterates to keep in its index set. We present an error analysis of MICE and a convergence analysis of Multi-Iteration Stochastic Optimizers for different classes of problems, including some non-convex cases. Within the smooth, strongly convex setting, we show that to approximate a minimizer with accuracy $tol$, SGD-MICE requires, on average, $O(tol^{-1})$ stochastic gradient evaluations, while SGD with adaptive batch sizes requires $O(tol^{-1} \log(tol^{-1}))$, correspondingly. Moreover, in a numerical evaluation, SGD-MICE achieved tol with less than 3% the number of gradient evaluations than adaptive batch SGD. The MICE estimator provides a straightforward stopping criterion based on the gradient norm that is validated in consistency tests. To assess the efficiency of MICE, we present several examples in which we use SGD-MICE and Adam-MICE. We include one example based on a stochastic adaptation of the Rosenbrock function and logistic regression training for various datasets. When compared to SGD, SAG, SAGA, SVRG, and SARAH, the Multi-Iteration Stochastic Optimizers reduced, without the need to tune parameters for each example, the gradient sampling cost in all cases tested, also being competitive in runtime in some cases.