迷你批量动量的轨迹：批量尺寸饱和度和收敛

论文标题

迷你批量动量的轨迹：批量尺寸饱和度和收敛

Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions

论文作者

Lee, Kiwon, Cheng, Andrew N., Paquette, Courtney, Paquette, Elliot

论文摘要

当样本数量和尺寸较大时，我们在最小二乘问题上分析了具有动量（SGD+M）的大批量随机梯度下降的动力学。在这种情况下，我们表明，随着尺寸的增加，SGD+M的动力学会收敛到确定性离散的Volterra方程，我们进行了分析。我们确定稳定性测量，即隐式调节比（ICR），该比率（ICR）调节SGD+M加速算法的能力。当批处理大小超过此ICR时，SGD+M以$ \ MATHCAL {O}（1/\SQRTκ）$线性收敛，与最佳的全零件动量相匹配（尤其是表现，并且具有全批次，但大小的一小部分）。相比之下，对于小于ICR的批量尺寸，SGD+M的速率像单批次SGD速率的倍数一样。我们从实现此性能的Hessian光谱方面为学习率和动量参数提供明确的选择。

We analyze the dynamics of large batch stochastic gradient descent with momentum (SGD+M) on the least squares problem when both the number of samples and dimensions are large. In this setting, we show that the dynamics of SGD+M converge to a deterministic discrete Volterra equation as dimension increases, which we analyze. We identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of $\mathcal{O}(1/\sqrtκ)$, matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题