论文标题
变压器训练期间参数规范增长的影响:梯度下降的感应偏差
Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent
论文作者
论文摘要
众所周知,像广泛采用的变压器这样的神经网络的能力很高。有证据表明,由于训练常规中的归纳偏见,他们成功地学习了,通常是梯度下降(GD)的变体。为了更好地理解这种偏见,我们研究了训练过程中变压器参数在幅度($ \ ell_2 $ norm)增长的趋势,以及它对自我注意力层中新兴表现的影响。从经验上讲,我们记录了变压器语言模型的培训中的规范增长,包括T5在预处理过程中。随着参数的大小增长,我们证明该网络近似具有饱和激活函数的离散网络。与可以用正式语言和自动机描述的完整网络家族相比,这种“饱和”网络的容量降低了。我们的结果表明,饱和是对NLP特别感兴趣的GD隐式归纳偏差的新表征。我们利用饱和变压器中的紧急离散结构来分析不同注意力头的作用,发现某些人本地关注少量位置,而其他头部则计算全局平均值,从而可以计数。我们认为,了解这两个功能之间的相互作用可能会进一步阐明大型变压器内的计算结构。
The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude ($\ell_2$ norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the full network family that can be described in terms of formal languages and automata. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP. We leverage the emergent discrete structure in a saturated transformer to analyze the role of different attention heads, finding that some focus locally on a small number of positions, while other heads compute global averages, allowing counting. We believe understanding the interplay between these two capabilities may shed further light on the structure of computation within large transformers.