关于时间外推的梯度下降的隐式偏差

论文标题

关于时间外推的梯度下降的隐式偏差

On the Implicit Bias of Gradient Descent for Temporal Extrapolation

论文作者

Cohen-Karlik, Edo, David, Avichai Ben, Cohen, Nadav, Globerson, Amir

论文摘要

当使用复发性神经网络（RNN）时，通常将受过训练的模型应用于序列比训练更长的序列。这种“推断”用法偏离了传统的统计学习设置，在该设置中提供了保证，假设火车和测试分布是相同的。在这里，我们着手了解RNN何时可以推断，重点是简单的情况，即数据生成分布是无记忆的。我们首先表明，即使有无限的训练数据，也存在完美插值的RNN模型（即它们符合训练数据）但却却差的序列很差到更长的序列。然后，我们表明，如果将梯度下降用于训练，那么在某些初始化的假设下，学习将融合到完美的外推。我们的结果补充了有关梯度下降的隐式偏见的最新研究，表明它在学习时间预测模型时在外推中起着关键作用。

When using recurrent neural networks (RNNs) it is common practice to apply trained models to sequences longer than those seen in training. This "extrapolating" usage deviates from the traditional statistical learning setup where guarantees are provided under the assumption that train and test distributions are identical. Here we set out to understand when RNNs can extrapolate, focusing on a simple case where the data generating distribution is memoryless. We first show that even with infinite training data, there exist RNN models that interpolate perfectly (i.e., they fit the training data) yet extrapolate poorly to longer sequences. We then show that if gradient descent is used for training, learning will converge to perfect extrapolation under certain assumptions on initialization. Our results complement recent studies on the implicit bias of gradient descent, showing that it plays a key role in extrapolation when learning temporal prediction models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题