学习和势头的黄金比率

论文标题

学习和势头的黄金比率

The Golden Ratio of Learning and Momentum

论文作者

Jaeger, Stefan

论文摘要

从早期开始到当今深度学习网络，梯度下降一直是人工神经网络的中央培训原则。最常见的实现是以有监督的方式训练前馈神经网络的反向传播算法。反向传播涉及计算网络权重的损失函数的梯度，以更新权重，从而最大程度地减少损失。尽管均方根误差通常被用作损失函数，但一般的随机梯度下降原理不会立即与特定的损耗函数连接。反向传播的另一个缺点是寻找两个重要的训练参数的最佳值，即学习率和动量权重，这些值在大多数系统中经过经验确定。学习率指定遵循梯度时的最小损耗功能的步长，而动量重量在更新当前重量时考虑了先前的重量。使用两个参数相互结合，通常被接受为改善训练的一种手段，尽管它们的特定值并未立即从标准的反向传播理论中立即遵循。本文提出了一个新的信息理论损失函数，该函数是由突触中神经信号处理的动机。新的损失函数意味着特定的学习率和动量重量，导致经常在实践中使用的经验参数。拟议的框架还提供了对动量术语及其对训练过程的平滑作用的更正式的解释。所有结果共同表明，损失，学习率和动量密切相关。为了支持这些理论发现，手写数字识别的实验显示了拟议的损失函数和训练参数的实际实用性。

Gradient descent has been a central training principle for artificial neural networks from the early beginnings to today's deep learning networks. The most common implementation is the backpropagation algorithm for training feed-forward neural networks in a supervised fashion. Backpropagation involves computing the gradient of a loss function, with respect to the weights of the network, to update the weights and thus minimize loss. Although the mean square error is often used as a loss function, the general stochastic gradient descent principle does not immediately connect with a specific loss function. Another drawback of backpropagation has been the search for optimal values of two important training parameters, learning rate and momentum weight, which are determined empirically in most systems. The learning rate specifies the step size towards a minimum of the loss function when following the gradient, while the momentum weight considers previous weight changes when updating current weights. Using both parameters in conjunction with each other is generally accepted as a means to improving training, although their specific values do not follow immediately from standard backpropagation theory. This paper proposes a new information-theoretical loss function motivated by neural signal processing in a synapse. The new loss function implies a specific learning rate and momentum weight, leading to empirical parameters often used in practice. The proposed framework also provides a more formal explanation of the momentum term and its smoothing effect on the training process. All results taken together show that loss, learning rate, and momentum are closely connected. To support these theoretical findings, experiments for handwritten digit recognition show the practical usefulness of the proposed loss function and training parameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题