关于变压器的计算能力及其在序列建模中的影响

论文标题

关于变压器的计算能力及其在序列建模中的影响

On the Computational Power of Transformers and its Implications in Sequence Modeling

论文作者

Bhattamishra, Satwik, Patel, Arkil, Goyal, Navin

论文摘要

变形金刚在几个序列建模任务中广泛使用。大量的研究工作已致力于实验探测变压器的内部工作。但是，我们对他们的力量和固有局限性的概念和理论理解仍然很新生。特别是，各种组件在变压器中的作用，例如位置编码，注意力头，残留连接和前馈网络，尚不清楚。在本文中，我们朝着回答这些问题迈出了一步。我们分析了图灵完整性捕获的计算能力。我们首先提供了一个替代，更简单的证据，以证明香草变压器是图林完成的，然后证明只有位置掩盖而没有任何位置编码的变压器也是图灵完整的。我们进一步分析了每个组成部分的必要性，以使网络的整个完整性。有趣的是，我们发现需要一种特定类型的残差连接。我们通过实验机器翻译和合成任务来证明结果的实际含义。

Transformers are being used extensively across several sequence modeling tasks. Significant research effort has been devoted to experimentally probe the inner workings of Transformers. However, our conceptual and theoretical understanding of their power and inherent limitations is still nascent. In particular, the roles of various components in Transformers such as positional encodings, attention heads, residual connections, and feedforward networks, are not clear. In this paper, we take a step towards answering these questions. We analyze the computational power as captured by Turing-completeness. We first provide an alternate and simpler proof to show that vanilla Transformers are Turing-complete and then we prove that Transformers with only positional masking and without any positional encoding are also Turing-complete. We further analyze the necessity of each component for the Turing-completeness of the network; interestingly, we find that a particular type of residual connection is necessary. We demonstrate the practical implications of our results via experiments on machine translation and synthetic tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题