论文标题
无与伦比的自我监督学习中预测头的机制
The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning
论文作者
论文摘要
最近,Grill等人对自己的潜伏(BYOL)方法的引导程序令人惊讶。如果我们将所谓的预测头添加到网络中,则可以消除对比损失的负项。这启动了非对抗性自我监督学习的研究。这是神秘的原因,为什么即使存在微不足道的全球最佳解决方案,受(随机)梯度下降培训的神经网络仍然可以学习竞争性表示。这种现象是深度学习中隐性偏见的一个典型例子,几乎没有理解。 在这项工作中,我们介绍了关于非对抗性自我监督学习的经验和理论发现。从经验上讲,我们发现,当预测头被初始化为一个身份矩阵,其非对角线条目是可训练的,即使培训目标中仍然存在琐碎的Optima,网络也可以学习竞争性表示。从理论上讲,我们提出了一个框架,以了解可训练但身份定位的预测头的行为。在简单的环境下,我们表征了预测头的替代效果和加速效应。当学习某些神经元中的更强特征可以通过更新预测头来替代其他神经元中的这些特征时,就会发生替代效果。加速效应发生在替代功能可以加速学习其他弱特征以防止它们被忽略时。这两种效果使神经网络能够学习所有特征,而不是仅专注于学习更强的特征,这可能是尺寸崩溃现象的原因。据我们所知,这也是使用具有可训练预测头和归一化的非线性神经网络的非对抗性方法的第一个端到端优化保证。
Recently the surprising discovery of the Bootstrap Your Own Latent (BYOL) method by Grill et al. shows the negative term in contrastive loss can be removed if we add the so-called prediction head to the network. This initiated the research of non-contrastive self-supervised learning. It is mysterious why even when there exist trivial collapsed global optimal solutions, neural networks trained by (stochastic) gradient descent can still learn competitive representations. This phenomenon is a typical example of implicit bias in deep learning and remains little understood. In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective. Theoretically, we present a framework to understand the behavior of the trainable, but identity-initialized prediction head. Under a simple setting, we characterized the substitution effect and acceleration effect of the prediction head. The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head. And the acceleration effect happens when the substituted features can accelerate the learning of other weaker features to prevent them from being ignored. These two effects enable the neural networks to learn all the features rather than focus only on learning the stronger features, which is likely the cause of the dimensional collapse phenomenon. To the best of our knowledge, this is also the first end-to-end optimization guarantee for non-contrastive methods using nonlinear neural networks with a trainable prediction head and normalization.