论文标题
在深网中,平面最小值的独特特性
Unique Properties of Flat Minima in Deep Networks
论文作者
论文摘要
众所周知,(随机)梯度下降对平面最小值具有隐式偏见。在深度神经网络培训中,该机制可以筛选最小值。但是,这对训练网络的确切效果尚未完全理解。在本文中,我们表征了经过二次损失的线性神经网络中的扁平最小值。首先,我们表明,零初始化的线性重新连接必定会收敛到所有最小值的最平坦。然后,我们证明了这些最小值对应于几乎平衡的网络,从而从输入到任何中间表示的增益不会从一层到另一层都发生巨大变化。最后,我们表明,在平坦的最小溶液中连续层耦合。也就是说,每个重量矩阵的左单数矢量之一等于下一个矩阵的右奇异向量之一。这形成了从输入到输出的独特路径,正如我们所表明的那样,它专门针对经历最大增益端到端的信号。实验表明,这些特性是在实践中训练的线性和非线性模型的特征。
It is well known that (stochastic) gradient descent has an implicit bias towards flat minima. In deep neural network training, this mechanism serves to screen out minima. However, the precise effect that this has on the trained network is not yet fully understood. In this paper, we characterize the flat minima in linear neural networks trained with a quadratic loss. First, we show that linear ResNets with zero initialization necessarily converge to the flattest of all minima. We then prove that these minima correspond to nearly balanced networks whereby the gain from the input to any intermediate representation does not change drastically from one layer to the next. Finally, we show that consecutive layers in flat minima solutions are coupled. That is, one of the left singular vectors of each weight matrix, equals one of the right singular vectors of the next matrix. This forms a distinct path from input to output, that, as we show, is dedicated to the signal that experiences the largest gain end-to-end. Experiments indicate that these properties are characteristic of both linear and nonlinear models trained in practice.