论文标题
巨型:移动平均装备的封闭式注意力
Mega: Moving Average Equipped Gated Attention
论文作者
论文摘要
变压器注意机制中的设计选择,包括弱电感偏置和二次计算复杂性,限制了其在长序列建模的应用。在本文中,我们介绍了一个简单的,理论上的,单头的门控注意机制,配备了(指数)移动平均值,以将局部依赖性的局部依赖性的电感偏置纳入位置 - 敏锐的注意机制中。我们进一步提出了一个具有线性时间和空间复杂性的大型变体,但通过将整个序列分为固定长度的多个块,仅产生最小的质量损失。对广泛的序列建模基准测试的广泛实验,包括远距离竞技场,神经机器翻译,自动回归语言建模以及图像和语音分类,表明,巨型比其他序列模型(包括变体的变体和最近的状态空间模型)取得了重大改进。
The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences. In this paper, we introduce Mega, a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. We further propose a variant of Mega that offers linear time and space complexity yet yields only minimal quality loss, by efficiently splitting the whole sequence into multiple chunks with fixed length. Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.