论文标题

在变压器体系结构中的图层归一化

On Layer Normalization in the Transformer Architecture

论文作者

Xiong, Ruibin, Yang, Yunchang, He, Di, Zheng, Kai, Zheng, Shuxin, Xing, Chen, Zhang, Huishuai, Lan, Yanyan, Wang, Liwei, Liu, Tie-Yan

论文摘要

变压器广泛用于自然语言处理任务。但是,要训练变压器,通常需要一个精心设计的学习率热身阶段,这对最终表现至关重要,但会减慢优化并带来更多的高参数调谐。在本文中,我们首先从理论上研究了为什么学习率热身阶段至关重要,并表明层归一化的位置很重要。具体而言,我们以平均场理论证明,在初始化时,对于原始设计的LN变压器(将层归一化的层归一化)放在残留块之间,在输出层附近的参数的预期梯度很大。因此,对这些梯度的学习率很高会使培训不稳定。热身阶段实际上有助于避免此问题。另一方面,我们的理论还表明,如果将层归一化放置在残差块内(最近提出为前LN变压器),则梯度在初始化时会很好地表现出来。这激发了我们去除训练前LN变压器的热身阶段。我们在实验中表明,没有热身阶段的前LN变压器可以与基准达到可比的结果,同时需要在广泛的应用上需要较小的训练时间和超参数调整。

The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源