关于MSE损失下神经崩溃的优化景观：具有无约束特征的全球最优性

论文标题

关于MSE损失下神经崩溃的优化景观：具有无约束特征的全球最优性

On the Optimization Landscape of Neural Collapse under MSE Loss: Global Optimality with Unconstrained Features

论文作者

Zhou, Jinxin, Li, Xiao, Ding, Tianyu, You, Chong, Qu, Qing, Zhu, Zhihui

论文摘要

When training deep neural networks for classification tasks, an intriguing empirical phenomenon has been widely observed in the last-layer classifiers and features, where (i) the class means and the last-layer classifiers all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to scaling, and (ii) cross-example within-class variability of last-layer activations collapses to zero.这种现象称为神经塌陷（NC），无论损失功能的选择如何，它似乎都会发生。在这项工作中，我们在平均平方误差（MSE）损失下证明了NC是合理的，最近的经验证据表明，它的性能相当甚至比事实上的跨透明膜损失相当甚至更好。在简化的无约束特征模型下，我们为香草非凸MSE损失提供了第一个全球景观分析，并表明（仅！）全球最小化器是神经崩溃的解决方案，而所有其他关键点是严格的鞍座，其Hessian的Hessian表现为负曲率指示。此外，我们通过探测NC解决方案周围的优化景观来证明使用后MSE损失的使用是合理的，这表明可以通过调整重新缩放的超参数来改善景观。最后，我们的理论发现在实践网络体系结构上进行了实验验证。

When training deep neural networks for classification tasks, an intriguing empirical phenomenon has been widely observed in the last-layer classifiers and features, where (i) the class means and the last-layer classifiers all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to scaling, and (ii) cross-example within-class variability of last-layer activations collapses to zero. This phenomenon is called Neural Collapse (NC), which seems to take place regardless of the choice of loss functions. In this work, we justify NC under the mean squared error (MSE) loss, where recent empirical evidence shows that it performs comparably or even better than the de-facto cross-entropy loss. Under a simplified unconstrained feature model, we provide the first global landscape analysis for vanilla nonconvex MSE loss and show that the (only!) global minimizers are neural collapse solutions, while all other critical points are strict saddles whose Hessian exhibit negative curvature directions. Furthermore, we justify the usage of rescaled MSE loss by probing the optimization landscape around the NC solutions, showing that the landscape can be improved by tuning the rescaling hyperparameters. Finally, our theoretical findings are experimentally verified on practical network architectures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题