论文标题
具有依赖重量的深度神经网络:高斯工艺混合限制,重型尾巴,稀疏性和可压缩性
Deep neural networks with dependent weights: Gaussian Process mixture limit, heavy tails, sparsity and compressibility
论文作者
论文摘要
本文研究了权重依赖并通过高斯分布的混合物进行建模的深馈神经网络的无限宽度极限。网络的每个隐藏节点均分配一个非负随机变量,该变量控制该节点的传出权重方差。我们对这些每节点随机变量做出最小的假设:它们是IID,在每一层中,它们在无限宽度限制中收敛到某些有限的随机变量。在此模型下,我们表明,无限宽度神经网络的每一层都可以以两个简单的数量为特征:非负标量参数和对正真实的levy量度。如果标量参数严格为正,并且在所有隐藏层上都是微不足道的,则恢复使用IID高斯权重获得的经典高斯工艺(GP)极限。更有趣的是,如果至少一层的Lévy度量是非平凡的,则在大宽度极限中获得高斯过程(MOGP)的混合物。在该制度中,神经网络的行为与GP制度大不相同。一个人获得了相关的输出,具有非高斯分布,可能带有沉重的尾巴。此外,我们表明,在这种状态下,权重是可压缩的,并且某些节点具有渐近不可忽略的贡献,因此代表了重要的隐藏特征。许多促进稀疏性神经网络模型可以作为我们方法的特殊情况进行重新铸造,我们讨论了它们的无限宽度限制。我们还提出了对修剪误差的渐近分析。我们说明了MOGP制度对GP制度的一些好处,这些好处在表示,MNIST和时尚MNIST数据集的表示和可压缩性方面。
This article studies the infinite-width limit of deep feedforward neural networks whose weights are dependent, and modelled via a mixture of Gaussian distributions. Each hidden node of the network is assigned a nonnegative random variable that controls the variance of the outgoing weights of that node. We make minimal assumptions on these per-node random variables: they are iid and their sum, in each layer, converges to some finite random variable in the infinite-width limit. Under this model, we show that each layer of the infinite-width neural network can be characterised by two simple quantities: a non-negative scalar parameter and a Lévy measure on the positive reals. If the scalar parameters are strictly positive and the Lévy measures are trivial at all hidden layers, then one recovers the classical Gaussian process (GP) limit, obtained with iid Gaussian weights. More interestingly, if the Lévy measure of at least one layer is non-trivial, we obtain a mixture of Gaussian processes (MoGP) in the large-width limit. The behaviour of the neural network in this regime is very different from the GP regime. One obtains correlated outputs, with non-Gaussian distributions, possibly with heavy tails. Additionally, we show that, in this regime, the weights are compressible, and some nodes have asymptotically non-negligible contributions, therefore representing important hidden features. Many sparsity-promoting neural network models can be recast as special cases of our approach, and we discuss their infinite-width limits; we also present an asymptotic analysis of the pruning error. We illustrate some of the benefits of the MoGP regime over the GP regime in terms of representation learning and compressibility on simulated, MNIST and Fashion MNIST datasets.