论文标题

自适应优化如何影响局部神经网络的几何形状?

How Does Adaptive Optimization Impact Local Neural Network Geometry?

论文作者

Jiang, Kaiqi, Malik, Dhruv, Li, Yuanzhi

论文摘要

自适应优化方法众所周知,相对于香草梯度方法,可以实现优越的收敛性。优化中的传统观点,尤其是在凸优化中,通过争辩说,与香草梯度方案不同,自适应算法模仿了二阶方法的行为,从而解释了这种改善的性能,通过适应损失函数的全球几何形状来模仿二阶方法的行为。我们认为,在神经网络优化的背景下,这种传统的观点不足。相反,我们主张进行局部轨迹分析。对于通过运行通用优化算法OPT产生的迭代轨迹,我们介绍了$ r^{\ text {opt}} _ {\ text {med}} $,该统计量类似于在迭代率上评估的损失的条件数。通过广泛的实验,我们表明自适应方法(例如亚当)偏向于$ r^{\ text {adam}} _ {\ text {med}} $的轨迹,其中一个人可能期望更快地收敛。相比之下,像sgd偏置的香草梯度方法偏向于$ r^{\ text {sgd}} _ {\ text {med}} $的轨迹相对较大。我们将这些经验观察与理论上的结果进行了补充,该结果可证明在两层线性网络的简化设置中证明了这一现象。我们认为我们的发现是需要对自适应方法成功的新解释的证据,这种方法与传统智慧不同。

Adaptive optimization methods are well known to achieve superior convergence relative to vanilla gradient methods. The traditional viewpoint in optimization, particularly in convex optimization, explains this improved performance by arguing that, unlike vanilla gradient schemes, adaptive algorithms mimic the behavior of a second-order method by adapting to the global geometry of the loss function. We argue that in the context of neural network optimization, this traditional viewpoint is insufficient. Instead, we advocate for a local trajectory analysis. For iterate trajectories produced by running a generic optimization algorithm OPT, we introduce $R^{\text{OPT}}_{\text{med}}$, a statistic that is analogous to the condition number of the loss Hessian evaluated at the iterates. Through extensive experiments, we show that adaptive methods such as Adam bias the trajectories towards regions where $R^{\text{Adam}}_{\text{med}}$ is small, where one might expect faster convergence. By contrast, vanilla gradient methods like SGD bias the trajectories towards regions where $R^{\text{SGD}}_{\text{med}}$ is comparatively large. We complement these empirical observations with a theoretical result that provably demonstrates this phenomenon in the simplified setting of a two-layer linear network. We view our findings as evidence for the need of a new explanation of the success of adaptive methods, one that is different than the conventional wisdom.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源