深度重新连接及以后的平均场分析：通过深度过度参数化迈向可证明的优化

论文标题

深度重新连接及以后的平均场分析：通过深度过度参数化迈向可证明的优化

A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable Optimization Via Overparameterization From Depth

论文作者

Lu, Yiping, Ma, Chao, Lu, Yulong, Lu, Jianfeng, Ying, Lexing

论文摘要

具有随机梯度下降（SGD）的训练深神经网络通常可以在现实世界中实现零训练损失，尽管已知优化格局高度非凸面。为了了解SGD在训练深神经网络中的成功，这项工作基于一系列作品，对深层残留网络进行了平均场分析，这些作品将深度残留网络的连续限制解释为当网络容量趋向于无限限时。具体来说，我们提出了深层残留网络的新连续限制，从每个本地最小化的景观都是全球化的意义上讲，它享有良好的景观。这种表征使我们能够在平均场式中获得多层神经网络的第一个全局收敛结果。此外，在不假设损失格局的凸度的情况下，我们的证明依赖于在全球最小化器上的零损失假设，而在模型具有通用近似属性时，该假设可以实现。我们结果的关键是观察到，深度残留网络类似于浅网络集合，即两层网络。我们通过伴随灵敏度方法界定了浅网络和我们的重新网络模型之间的差异，这使我们能够将两层网络的现有平均场分析应用于深网。此外，我们根据新的连续模型提出了几种新颖的培训方案，包括一个切换残差块的顺序，并在基准数据集中实现强烈的经验性能。

Training deep neural networks with stochastic gradient descent (SGD) can often achieve zero training loss on real-world tasks although the optimization landscape is known to be highly non-convex. To understand the success of SGD for training deep neural networks, this work presents a mean-field analysis of deep residual networks, based on a line of works that interpret the continuum limit of the deep residual network as an ordinary differential equation when the network capacity tends to infinity. Specifically, we propose a new continuum limit of deep residual networks, which enjoys a good landscape in the sense that every local minimizer is global. This characterization enables us to derive the first global convergence result for multilayer neural networks in the mean-field regime. Furthermore, without assuming the convexity of the loss landscape, our proof relies on a zero-loss assumption at the global minimizer that can be achieved when the model shares a universal approximation property. Key to our result is the observation that a deep residual network resembles a shallow network ensemble, i.e. a two-layer network. We bound the difference between the shallow network and our ResNet model via the adjoint sensitivity method, which enables us to apply existing mean-field analyses of two-layer networks to deep networks. Furthermore, we propose several novel training schemes based on the new continuous model, including one training procedure that switches the order of the residual blocks and results in strong empirical performance on the benchmark datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题