在深度神经网络训练中，随机梯度下降的不连贯

论文标题

在深度神经网络训练中，随机梯度下降的不连贯

Non-convergence of stochastic gradient descent in the training of deep neural networks

论文作者

Cheridito, Patrick, Jentzen, Arnulf, Rossmannek, Florian

论文摘要

深层神经网络已在具有随机梯度下降的各个应用领域成功培训。但是，没有严格的数学解释，为什么它如此奏效。具有随机梯度下降的神经网络的培训具有四个不同的离散参数：（i）网络体系结构；（ii）培训数据的量；（iii）梯度步骤的数量；（iv）随机初始化梯度轨迹的数量。虽然可以证明，如果所有四个参数以正确的顺序发送到无穷大，则近似误差会收敛到零，但我们在本文中证明，如果随机梯度下降量不得收敛于relu网络，如果它们的深度远大于其宽度，并且随机初始化的数量不足以足够快地增加。

Deep neural networks have successfully been trained in various application areas with stochastic gradient descent. However, there exists no rigorous mathematical explanation why this works so well. The training of neural networks with stochastic gradient descent has four different discretization parameters: (i) the network architecture; (ii) the amount of training data; (iii) the number of gradient steps; and (iv) the number of randomly initialized gradient trajectories. While it can be shown that the approximation error converges to zero if all four parameters are sent to infinity in the right order, we demonstrate in this paper that stochastic gradient descent fails to converge for ReLU networks if their depth is much larger than their width and the number of random initializations does not increase to infinity fast enough.

下载PDF全文

下载文献需遵守相关版权规定

论文标题