理解脱钩和提早体重衰减

论文标题

理解脱钩和提早体重衰减

Understanding Decoupled and Early Weight Decay

论文作者

Bjorck, Johan, Weinberger, Kilian, Gomes, Carla

论文摘要

体重衰减（WD）是深度学习中的一种传统正则化技术，但是尽管它无处不在，但其行为仍然是一个积极研究的领域。 Golatkar等。最近表明，WD仅在计算机视觉训练开始时就很重要，从而颠覆了传统智慧。 Loshchilov等。证明对于自适应优化器，手动衰减的权重可以超越损失的$ L_2 $罚款。该技术已经变得越来越流行，被称为脱钩的WD。本文的目的是研究这两个最近的经验观察。我们证明，通过仅在开始时应用WD，网络规范在整个培训中都保持较小。随着有效的梯度更新变得更大，这具有正规效果。但是，传统的概括指标无法捕获WD的这种效果，我们展示了简单的规模不变指标。我们还展示了网络权重的增长如何受数据集及其概括属性的严重影响。对于脱钩的WD，我们在NLP和RL中执行实验，其中自适应优化器是标准。我们证明，解耦WD减轻的主要问题是将梯度从目标函数和Adam的缓冲区中的$ L_2 $罚款混合在一起（该梯度存储了一阶时刻的估计值）。适应性本身并不是问题，而脱钩的WD可以确保$ L_2 $项的梯度不能“淹没”真正的目标，从而促进了更轻松的高参数调整。

Weight decay (WD) is a traditional regularization technique in deep learning, but despite its ubiquity, its behavior is still an area of active research. Golatkar et al. have recently shown that WD only matters at the start of the training in computer vision, upending traditional wisdom. Loshchilov et al. show that for adaptive optimizers, manually decaying weights can outperform adding an $l_2$ penalty to the loss. This technique has become increasingly popular and is referred to as decoupled WD. The goal of this paper is to investigate these two recent empirical observations. We demonstrate that by applying WD only at the start, the network norm stays small throughout training. This has a regularizing effect as the effective gradient updates become larger. However, traditional generalizations metrics fail to capture this effect of WD, and we show how a simple scale-invariant metric can. We also show how the growth of network weights is heavily influenced by the dataset and its generalization properties. For decoupled WD, we perform experiments in NLP and RL where adaptive optimizers are the norm. We demonstrate that the primary issue that decoupled WD alleviates is the mixing of gradients from the objective function and the $l_2$ penalty in the buffers of Adam (which stores the estimates of the first-order moment). Adaptivity itself is not problematic and decoupled WD ensures that the gradients from the $l_2$ term cannot "drown out" the true objective, facilitating easier hyperparameter tuning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题