超越二次近似：神经网络损失景观的多尺度结构

论文标题

超越二次近似：神经网络损失景观的多尺度结构

Beyond the Quadratic Approximation: the Multiscale Structure of Neural Network Loss Landscapes

论文作者

Ma, Chao, Kunin, Daniel, Wu, Lei, Ying, Lexing

论文摘要

神经网络损失景观的二次近似已广泛用于研究这些网络的优化过程。但是，它通常存在于最低限度的一个很小的社区，但它无法解释在优化过程中观察到的许多现象。在这项工作中，我们研究了神经网络损失函数的结构及其对超出良好二次近似范围的区域中优化的影响。从数值上讲，我们观察到神经网络损失功能具有多尺度结构，以两种方式表现出来：（1）在Minima的街区中，损失混合了尺度的连续体和次要形式增长，并且（2）在较大的区域中，损失显示了几个单独的尺度。使用次级生长，我们能够解释梯度下降（GD）方法观察到的稳定现象的边缘[5]。使用单独的量表，我们通过简单的示例解释了学习率衰减的工作机理。最后，我们研究了多尺度结构的起源，并提出模型的非跨性别性和训练数据的不均匀性是原因之一。通过构建两层神经网络问题，我们表明，具有不同幅度的训练数据会产生损失函数的不同尺度，从而产生亚二次生长和多个单独的量表。

A quadratic approximation of neural network loss landscapes has been extensively used to study the optimization process of these networks. Though, it usually holds in a very small neighborhood of the minimum, it cannot explain many phenomena observed during the optimization process. In this work, we study the structure of neural network loss functions and its implication on optimization in a region beyond the reach of a good quadratic approximation. Numerically, we observe that neural network loss functions possesses a multiscale structure, manifested in two ways: (1) in a neighborhood of minima, the loss mixes a continuum of scales and grows subquadratically, and (2) in a larger region, the loss shows several separate scales clearly. Using the subquadratic growth, we are able to explain the Edge of Stability phenomenon [5] observed for the gradient descent (GD) method. Using the separate scales, we explain the working mechanism of learning rate decay by simple examples. Finally, we study the origin of the multiscale structure and propose that the non-convexity of the models and the non-uniformity of training data is one of the causes. By constructing a two-layer neural network problem we show that training data with different magnitudes give rise to different scales of the loss function, producing subquadratic growth and multiple separate scales.

下载PDF全文

下载文献需遵守相关版权规定

论文标题