记忆而无需过度拟合：过度参数模型中的偏见，方差和插值

论文标题

记忆而无需过度拟合：过度参数模型中的偏见，方差和插值

Memorizing without overfitting: Bias, variance, and interpolation in over-parameterized models

论文作者

Rocks, Jason W., Mehta, Pankaj

论文摘要

偏见变化权衡是监督学习中的一个核心概念。在经典统计中，增加模型（例如参数数）的复杂性会降低偏差，但也会增加差异。直到最近，人们通常认为，在中间模型的复杂性上达到了最佳性能，这在偏差和方差之间取得了平衡。现代深度学习方法通过这种教条削弱了这种教条，使用“过度参数化模型”来实现最先进的性能，其中拟合参数的数量足够大，可以完美地适合训练数据。结果，了解过度参数模型的偏见和差异已成为机器学习中的一个基本问题。在这里，我们使用从统计物理学的方法来得出两个最小参数化模型（线性回归和具有非线性数据分布的两层神经网络）的分析表达式，从而使我们能够从模型结构和数据的随机取样中解散源于模型架构的属性。在这两个模型中，增加拟合参数的数量都会导致相变，即训练误差为零，并且由于差异而导致测试误差差异（而偏差保持有限）。除了这个阈值之外，两层神经网络的测试误差由于\ emph {两者}的单调降低而降低，这与经典的偏见变化权衡相反。我们还表明，与经典的直觉相反，即使没有噪音和教师模型匹配，也没有噪音和表现出偏见，过度参数化的模型即使在没有噪音的情况下也可以过度拟合。我们合成这些结果，以构建对概括误差的整体理解和过度参数化模型中的偏差变化权衡，并将我们的结果与随机矩阵理论联系起来。

The bias-variance trade-off is a central concept in supervised learning. In classical statistics, increasing the complexity of a model (e.g., number of parameters) reduces bias but also increases variance. Until recently, it was commonly believed that optimal performance is achieved at intermediate model complexities which strike a balance between bias and variance. Modern Deep Learning methods flout this dogma, achieving state-of-the-art performance using "over-parameterized models" where the number of fit parameters is large enough to perfectly fit the training data. As a result, understanding bias and variance in over-parameterized models has emerged as a fundamental problem in machine learning. Here, we use methods from statistical physics to derive analytic expressions for bias and variance in two minimal models of over-parameterization (linear regression and two-layer neural networks with nonlinear data distributions), allowing us to disentangle properties stemming from the model architecture and random sampling of data. In both models, increasing the number of fit parameters leads to a phase transition where the training error goes to zero and the test error diverges as a result of the variance (while the bias remains finite). Beyond this threshold, the test error of the two-layer neural network decreases due to a monotonic decrease in \emph{both} the bias and variance in contrast with the classical bias-variance trade-off. We also show that in contrast with classical intuition, over-parameterized models can overfit even in the absence of noise and exhibit bias even if the student and teacher models match. We synthesize these results to construct a holistic understanding of generalization error and the bias-variance trade-off in over-parameterized models and relate our results to random matrix theory.

下载PDF全文

下载文献需遵守相关版权规定

论文标题