论文标题

特征学习的高维渐近学:一个梯度步骤如何改善表示形式

High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation

论文作者

Ba, Jimmy, Erdogdu, Murat A., Suzuki, Taiji, Wang, Zhichao, Wu, Denny, Yang, Greg

论文摘要

我们在两层神经网络中研究第一层参数$ \ boldsymbol {w} $的第一个梯度下降步骤:$ f(\ boldsymbol {x})= \ frac {1} {\ sqrt {n}} \ boldsymbol {a}^\topσ(\ boldsymbol {w}^\ top \ top \ boldsymbol {x})$,其中$ \ boldsymbol {w} \ boldsymbol {a} \ in \ mathbb {r}^{n} $是随机初始化的,训练目标是经验性MSE损失:$ \ frac {1} {n} {n} \ sum_ {i = 1}^n(f(f(f boldsymbol {f(\ boldsymbol {x}}}}} _i _i _i _i _i)^2 $ yi)^2 $ yi)在比例的渐近限制中,以相同的速度以$ n,d,n \ to \ infty $以及理想化的学生教师设置,我们表明第一个梯度更新包含等级-1“ Spike”,这会导致一流的权重与老师型号$ f^*$的线性重量之间的对齐。为了表征这种对齐的影响,我们计算了$ \ boldsymbol {w} $在$ \ boldsymbol {w} $上使用学习率$η$的梯度步骤之后的脊回归风险,而$ f^*$是单个单位模型。我们考虑了第一步学习率$η$的两个量表。对于小$η$,我们为训练有素的功能图建立了高斯等效属性,并证明了学识渊博的内核可以改善最初的随机特征模型,但不能击败输入上最佳的线性模型。尽管对于足够大的$η$,我们证明,对于某些$ f^*$,训练有素的功能上相同的山脊估算器可以超越此“线性态度”,并且胜过广泛的随机功能和旋转不变的内核。我们的结果表明,即使是一个梯度步骤也可以使比随机特征具有相当大的优势,并突出了学习率缩放在训练的初始阶段的作用。

We study the first gradient descent step on the first-layer parameters $\boldsymbol{W}$ in a two-layer neural network: $f(\boldsymbol{x}) = \frac{1}{\sqrt{N}}\boldsymbol{a}^\topσ(\boldsymbol{W}^\top\boldsymbol{x})$, where $\boldsymbol{W}\in\mathbb{R}^{d\times N}, \boldsymbol{a}\in\mathbb{R}^{N}$ are randomly initialized, and the training objective is the empirical MSE loss: $\frac{1}{n}\sum_{i=1}^n (f(\boldsymbol{x}_i)-y_i)^2$. In the proportional asymptotic limit where $n,d,N\to\infty$ at the same rate, and an idealized student-teacher setting, we show that the first gradient update contains a rank-1 "spike", which results in an alignment between the first-layer weights and the linear component of the teacher model $f^*$. To characterize the impact of this alignment, we compute the prediction risk of ridge regression on the conjugate kernel after one gradient step on $\boldsymbol{W}$ with learning rate $η$, when $f^*$ is a single-index model. We consider two scalings of the first step learning rate $η$. For small $η$, we establish a Gaussian equivalence property for the trained feature map, and prove that the learned kernel improves upon the initial random features model, but cannot defeat the best linear model on the input. Whereas for sufficiently large $η$, we prove that for certain $f^*$, the same ridge estimator on trained features can go beyond this "linear regime" and outperform a wide range of random features and rotationally invariant kernels. Our results demonstrate that even one gradient step can lead to a considerable advantage over random features, and highlight the role of learning rate scaling in the initial phase of training.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源