深度学习中的大渐近学

论文标题

深度学习中的大渐近学

Large-time asymptotics in deep learning

论文作者

Esteve, Carlos, Geshkovski, Borjan, Pighin, Dario, Zuazua, Enrique

论文摘要

我们考虑了监督学习的神经颂歌观点，并研究了最后一次$ t $（可能表明相应重新连接的深度）在培训中的影响。对于经典的$ l^2 $ - 正常的经验风险最小化问题，每当神经动力学相对于参数均匀时，我们表明训练错误最多是$ \ MATHCAL {O} \ left（\ frac {\ frac {1}} {t} {t} {t} \ right）$。此外，如果引起经验风险的损失达到其最小值，则最佳参数会收敛到最小$ l^2 $ - 插入数据集的规范参数。通过$ t $和正规化超参数$λ$的自然缩放，当固定$λ\ searrow0 $和$ t $时，我们获得相同的结果。这使我们能够规定过度兼容性方案中的概括性能，现在从大深度，神经颂歌的角度看出。为了增强多项式衰减，受最佳控制中的收费公路理论的启发，我们提出了一个学习问题，其神经ode轨迹的附加正规化项超过$ [0，t] $。在$ \ ell^p $ - 距离损失的情况下，我们证明训练错误和最佳参数最多是$ \ Mathcal {o} \ left（e^{ - μt} \ right）$ in [0，t] $中的任何$ t \ in IN $ t \。在连续的时空神经网络中，还显示了上述稳定性估计值，采用非线性integro差异方程式的形式。通过使用时间相关的移动网格来离散空间变量，我们证明了这些方程式提供了一个框架，用于解决具有可变宽度的重置。

We consider the neural ODE perspective of supervised learning and study the impact of the final time $T$ (which may indicate the depth of a corresponding ResNet) in training. For the classical $L^2$--regularized empirical risk minimization problem, whenever the neural ODE dynamics are homogeneous with respect to the parameters, we show that the training error is at most of the order $\mathcal{O}\left(\frac{1}{T}\right)$. Furthermore, if the loss inducing the empirical risk attains its minimum, the optimal parameters converge to minimal $L^2$--norm parameters which interpolate the dataset. By a natural scaling between $T$ and the regularization hyperparameter $λ$ we obtain the same results when $λ\searrow0$ and $T$ is fixed. This allows us to stipulate generalization properties in the overparametrized regime, now seen from the large depth, neural ODE perspective. To enhance the polynomial decay, inspired by turnpike theory in optimal control, we propose a learning problem with an additional integral regularization term of the neural ODE trajectory over $[0,T]$. In the setting of $\ell^p$--distance losses, we prove that both the training error and the optimal parameters are at most of the order $\mathcal{O}\left(e^{-μt}\right)$ in any $t\in[0,T]$. The aforementioned stability estimates are also shown for continuous space-time neural networks, taking the form of nonlinear integro-differential equations. By using a time-dependent moving grid for discretizing the spatial variable, we demonstrate that these equations provide a framework for addressing ResNets with variable widths.

下载PDF全文

下载文献需遵守相关版权规定

论文标题