论文标题
熵梯度下降算法和宽平坦的最小
Entropic gradient descent algorithms and wide flat minima
论文作者
论文摘要
在神经网络的经验风险景观中,平面最小值的特性已经辩论了一段时间。越来越多的证据表明,它们具有相对于锋利的能力更好的概括能力。首先,我们讨论高斯混合物分类模型,并在分析上表明,贝叶斯有最佳点估计器,该估计器对应于属于宽平面区域的最小化器。可以通过直接在分类器(标准独立)或学习中使用的可区分损耗函数上应用最大的平坦度算法来找到这些估计器。接下来,我们通过广泛的数值验证将分析扩展到深度学习方案。使用两种算法,即熵-SGD和重复的SGD,这些算法在优化目标中明确包含一种称为局部熵的非本地平坦度量,我们一致地改善了共同体系结构的概括误差(例如Resnet,EfficityNet)。一种易于计算的平坦度量显示出与测试准确性的明显相关性。
The properties of flat minima in the empirical risk landscape of neural networks have been debated for some time. Increasing evidence suggests they possess better generalization capabilities with respect to sharp ones. First, we discuss Gaussian mixture classification models and show analytically that there exist Bayes optimal pointwise estimators which correspond to minimizers belonging to wide flat regions. These estimators can be found by applying maximum flatness algorithms either directly on the classifier (which is norm independent) or on the differentiable loss function used in learning. Next, we extend the analysis to the deep learning scenario by extensive numerical validations. Using two algorithms, Entropy-SGD and Replicated-SGD, that explicitly include in the optimization objective a non-local flatness measure known as local entropy, we consistently improve the generalization error for common architectures (e.g. ResNet, EfficientNet). An easy to compute flatness measure shows a clear correlation with test accuracy.