变压器中的信号传播：理论观点和等级崩溃的作用

论文标题

变压器中的信号传播：理论观点和等级崩溃的作用

Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse

论文作者

Noci, Lorenzo, Anagnostidis, Sotiris, Biggio, Luca, Orvieto, Antonio, Singh, Sidak Pal, Lucchi, Aurelien

论文摘要

从自然语言处理到计算机视觉，变形金刚在几个领域取得了巨大的成功。然而，最近已经表明，堆叠自发注意层（变形金刚的独特架构成分）可能会导致在初始化时代币表示的等级崩溃。等级崩溃是否影响培训的问题仍未得到答复，并且对对该建筑的更全面的了解是必要的。在这项工作中，我们对这种现象的原因和影响有了新的启示。首先，我们表明，代币表示的等级崩溃会导致查询和钥匙的梯度在初始化时消失，从而阻碍了培训。此外，我们提供了对等级崩溃的起源的详尽描述，并讨论了如何通过对残留分支的适当深度依赖性缩放来预防它。最后，我们的分析揭示了特定的结构超级参数对查询和值的梯度有所不同，从而导致不成比例的梯度规范。这暗示了一种解释，用于广泛使用自适应方法进行变压器的优化。

Transformers have achieved remarkable success in several domains, ranging from natural language processing to computer vision. Nevertheless, it has been recently shown that stacking self-attention layers - the distinctive architectural component of Transformers - can result in rank collapse of the tokens' representations at initialization. The question of if and how rank collapse affects training is still largely unanswered, and its investigation is necessary for a more comprehensive understanding of this architecture. In this work, we shed new light on the causes and the effects of this phenomenon. First, we show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish at initialization. Furthermore, we provide a thorough description of the origin of rank collapse and discuss how to prevent it via an appropriate depth-dependent scaling of the residual branches. Finally, our analysis unveils that specific architectural hyperparameters affect the gradients of queries and values differently, leading to disproportionate gradient norms. This suggests an explanation for the widespread use of adaptive methods for Transformers' optimization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题