论文标题
揭开在非常高的维度中学习过度参数的全球融合难题
Demystifying the Global Convergence Puzzle of Learning Over-parameterized ReLU Nets in Very High Dimensions
论文作者
论文摘要
该理论论文致力于开发一种严格的理论,用于在充满挑战的情况下揭开全球收敛现象的神秘面纱:在非常轻微的假设下,学习过度参数化的校正线性单位(relu)网络对非常高维数据集进行了非常高的尺寸数据集。我们分析的主要成分是对随机激活矩阵的细粒分析。剖析激活矩阵的基本优点是,它在高维数据空间中桥接了优化和角度分布的动力学。这种基于角度的详细分析导致在每个梯度下降迭代处的梯度范围和目标函数方向曲率的渐近表征,从而表明经验损失函数在过度参数化的设置中具有不错的几何特性。在此过程中,我们在过度参数化条件和学习率上都显着提高了现有的理论界限,并以非常温和的假设来学习非常高维数据。此外,我们揭示了输入数据的几何和光谱特性在确定所需的过度参数化大小和全局收敛速率中的作用。所有这些线索使我们能够发现深度学习中非凸优化的新几何图片:高维数据空间中的角度分布$ \ mapsto $ spectrums $参数化激活矩阵$ \ mapsto $ foread $ hapsto $ fordable Glosscape $ \ mapsto $ global $ global contgence contegence convengence femomenon子元素的良好的几何特性。 Furthremore,我们的理论结果表明,基于梯度的非凸优化算法具有更强的统计保证,而过度参数过度的条件要比为学习非常高的尺寸数据而较温和的过度参数化条件,这是迄今为止很少探索的。
This theoretical paper is devoted to developing a rigorous theory for demystifying the global convergence phenomenon in a challenging scenario: learning over-parameterized Rectified Linear Unit (ReLU) nets for very high dimensional dataset under very mild assumptions. A major ingredient of our analysis is a fine-grained analysis of random activation matrices. The essential virtue of dissecting activation matrices is that it bridges the dynamics of optimization and angular distribution in high-dimensional data space. This angle-based detailed analysis leads to asymptotic characterizations of gradient norm and directional curvature of objective function at each gradient descent iteration, revealing that the empirical loss function enjoys nice geometrical properties in the overparameterized setting. Along the way, we significantly improve existing theoretical bounds on both over-parameterization condition and learning rate with very mild assumptions for learning very high dimensional data. Moreover, we uncover the role of the geometrical and spectral properties of the input data in determining desired over-parameterization size and global convergence rate. All these clues allow us to discover a novel geometric picture of nonconvex optimization in deep learning: angular distribution in high-dimensional data space $\mapsto$ spectrums of overparameterized activation matrices $\mapsto$ favorable geometrical properties of empirical loss landscape $\mapsto$ global convergence phenomenon. Furthremore, our theoretical results imply that gradient-based nonconvex optimization algorithms have much stronger statistical guarantees with much milder over-parameterization condition than exisiting theory states for learning very high dimensional data, which is rarely explored so far.