合成器：在变压器模型中重新思考自我注意

论文标题

合成器：在变压器模型中重新思考自我注意

Synthesizer: Rethinking Self-Attention in Transformer Models

论文作者

Tay, Yi, Bahri, Dara, Metzler, Donald, Juan, Da-Cheng, Zhao, Zhe, Zheng, Che

论文摘要

DOT产品自我注意事项是最新的变压器模型的中心和必不可少的。但是真的需要吗？本文研究了基于DOT产品的自我注意机制对变压器模型性能的真正重要性和贡献。通过广泛的实验，我们发现（1）随机对齐矩阵出人意料地表现出色，并且（2）从令牌 - token（Query-key）相互作用中学习注意力权重很有用，但毕竟并不重要。为此，我们提出了\ textsc {合成器}，该模型在没有令牌互动的情况下学习合成注意权重。在我们的实验中，我们首先表明，与一系列任务（包括机器翻译，语言建模，文本生成和胶水/超级插曲基准）相比，简单的合成器与Vanilla Transformer模型相比，实现了高度竞争性的性能。当由DOT产品注意力组成时，我们发现合成器始终超过了变压器。此外，我们对合成器与动态卷积进行了其他比较，这表明简单的随机合成器不仅是$ 60 \％$ $，而且还可以提高相对$ 3.5 \％$的困惑。最后，我们表明，简单分解的合成器在仅编码任务时可以胜过外观线形。

The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. In our experiments, we first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/SuperGLUE benchmarks. When composed with dot product attention, we find that Synthesizers consistently outperform Transformers. Moreover, we conduct additional comparisons of Synthesizers against Dynamic Convolutions, showing that simple Random Synthesizer is not only $60\%$ faster but also improves perplexity by a relative $3.5\%$. Finally, we show that simple factorized Synthesizers can outperform Linformers on encoding only tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题