学习低维状态空间具有过度参数的复发性神经网

论文标题

学习低维状态空间具有过度参数的复发性神经网

Learning Low Dimensional State Spaces with Overparameterized Recurrent Neural Nets

论文作者

Cohen-Karlik, Edo, Menuhin-Gruman, Itamar, Giryes, Raja, Cohen, Nadav, Globerson, Amir

论文摘要

深度学习中的过度参数化通常是指训练有素的神经网络（NN）具有多种方式适合培训数据的设置，其中一些能够很好地概括，而另一些则却没有。在复发性神经网络（RNN）的情况下，存在额外的过度参数化层，因为一个模型可能表现出许多解决方案，这些解决方案可以很好地推广到在训练中看到的序列长度，其中一些溶液可以推断到更长的序列，而另一些则没有。许多作品研究了梯度下降（GD）的趋势，以使过度参数化的NN与概述的溶液相吻合。另一方面，它倾向于将过度参数化的RNN与外推的溶液拟合，直到最近才发现，并且不了解。在本文中，当应用于过度参数的线性RNN时，我们分析了GD的外推性能。与最近的论点相反，暗示对短期记忆的隐性偏见，我们为学习低维状态空间提供了理论证据，这也可以模拟长期记忆。我们的结果依赖于动态表征，该表征表明GD（步骤尺寸较小，初始化接近零）努力保持某种形式的平衡性，以及在统计问题的瞬间问题的背景下开发的工具（从其时刻恢复了概率分布）。实验证实了我们的理论，证明了用线性和非线性RNN学习低维状态空间的外推。

Overparameterization in deep learning typically refers to settings where a trained neural network (NN) has representational capacity to fit the training data in many ways, some of which generalize well, while others do not. In the case of Recurrent Neural Networks (RNNs), there exists an additional layer of overparameterization, in the sense that a model may exhibit many solutions that generalize well for sequence lengths seen in training, some of which extrapolate to longer sequences, while others do not. Numerous works have studied the tendency of Gradient Descent (GD) to fit overparameterized NNs with solutions that generalize well. On the other hand, its tendency to fit overparameterized RNNs with solutions that extrapolate has been discovered only recently and is far less understood. In this paper, we analyze the extrapolation properties of GD when applied to overparameterized linear RNNs. In contrast to recent arguments suggesting an implicit bias towards short-term memory, we provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory. Our result relies on a dynamical characterization which shows that GD (with small step size and near-zero initialization) strives to maintain a certain form of balancedness, as well as on tools developed in the context of the moment problem from statistics (recovery of a probability distribution from its moments). Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题