多时间的普遍性和深度学习的局限性

论文标题

多时间的普遍性和深度学习的局限性

Poly-time universality and limitations of deep learning

论文作者

Abbe, Emmanuel, Sandon, Colin

论文摘要

本文的目的是表征深度学习可以或无法在多时间学习的功能分布。证明了基于SGD的深度学习的普遍性结果，并证明了基于GD的深度学习的非大学性结果；这也提供了基于SGD的深度学习和统计查询算法之间的分离：（1）{\ IT用SGD进行深度学习是有效通用的。}可以通过多种型神经网络在多尺度初始化，多率，poly-rate，poly-rate和poly-noise的多种时代初始化中训练的多种神经网络也可以从样品中学到的任何功能分布。因此，深度学习提供了通用的学习范式：众所周知，使用NP-HARD的ERM可以用多大小的神经网控制近似和估计误差。这个新结果表明，在Poly时也可以用SGD控制优化误差。图片变化了GD的大量大批量：（2）{\ IT结果（1）不适合GD：}多大小的神经网，在任何初始化的初始化，多范围，poly-nose和Poly-noise至少无法学习任何具有超级元素的功能分布的情况下，具有超级分布的均匀性，可以使跨度的功能和跨度的功能，从而使cross-crortery''的功能 - 讨论了统计维度。特别是，具有这些约束的GD可以在$ k $的情况下有效地学习$ k $的单一元素，并且只有$ k $是恒定的。因此，（1）和（2）指出一个有趣的对比：SGD即使有一些多噪声，而完整的GD或SQ算法则不（例如，平等）。

The goal of this paper is to characterize function distributions that deep learning can or cannot learn in poly-time. A universality result is proved for SGD-based deep learning and a non-universality result is proved for GD-based deep learning; this also gives a separation between SGD-based deep learning and statistical query algorithms: (1) {\it Deep learning with SGD is efficiently universal.} Any function distribution that can be learned from samples in poly-time can also be learned by a poly-size neural net trained with SGD on a poly-time initialization with poly-steps, poly-rate and possibly poly-noise. Therefore deep learning provides a universal learning paradigm: it was known that the approximation and estimation errors could be controlled with poly-size neural nets, using ERM that is NP-hard; this new result shows that the optimization error can also be controlled with SGD in poly-time. The picture changes for GD with large enough batches: (2) {\it Result (1) does not hold for GD:} Neural nets of poly-size trained with GD (full gradients or large enough batches) on any initialization with poly-steps, poly-range and at least poly-noise cannot learn any function distribution that has super-polynomial {\it cross-predictability,} where the cross-predictability gives a measure of ``average'' function correlation -- relations and distinctions to the statistical dimension are discussed. In particular, GD with these constraints can learn efficiently monomials of degree $k$ if and only if $k$ is constant. Thus (1) and (2) point to an interesting contrast: SGD is universal even with some poly-noise while full GD or SQ algorithms are not (e.g., parities).

下载PDF全文

下载文献需遵守相关版权规定

论文标题