巨型：移动平均装备的封闭式注意力

论文标题

巨型：移动平均装备的封闭式注意力

Mega: Moving Average Equipped Gated Attention

论文作者

Ma, Xuezhe, Zhou, Chunting, Kong, Xiang, He, Junxian, Gui, Liangke, Neubig, Graham, May, Jonathan, Zettlemoyer, Luke

论文摘要

变压器注意机制中的设计选择，包括弱电感偏置和二次计算复杂性，限制了其在长序列建模的应用。在本文中，我们介绍了一个简单的，理论上的，单头的门控注意机制，配备了（指数）移动平均值，以将局部依赖性的局部依赖性的电感偏置纳入位置 - 敏锐的注意机制中。我们进一步提出了一个具有线性时间和空间复杂性的大型变体，但通过将整个序列分为固定长度的多个块，仅产生最小的质量损失。对广泛的序列建模基准测试的广泛实验，包括远距离竞技场，神经机器翻译，自动回归语言建模以及图像和语音分类，表明，巨型比其他序列模型（包括变体的变体和最近的状态空间模型）取得了重大改进。

The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences. In this paper, we introduce Mega, a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. We further propose a variant of Mega that offers linear time and space complexity yet yields only minimal quality loss, by efficiently splitting the whole sequence into multiple chunks with fixed length. Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题