扩散器：有效的变压器，具有多跳的注意扩散的长序列

论文标题

扩散器：有效的变压器，具有多跳的注意扩散的长序列

Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences

论文作者

Feng, Aosong, Li, Irene, Jiang, Yuang, Ying, Rex

论文摘要

由于其次级记忆和时间复杂性，已为长序列建模开发了有效的变压器。稀疏变压器是一种流行的方法，可以通过将自我注意力限制为预定义稀疏模式指定的位置来提高变压器的效率。但是，与全注意相比，利用稀疏性可能会牺牲表现力，而当重要的令牌相关性是多个啤酒花时。为了结合稀疏变压器效率的优势和全注意变压器的表现力，我们提出了一种新的最先进的有效变压器。扩散器将所有令牌相互作用纳入一个注意力层中，同时保持低计算和记忆成本。关键思想是使用注意扩散扩展稀疏注意的接收场，该扩散基于相应的分离令牌之间的所有路径，还基于相邻令牌之间的所有路径来计算多跳代币的相关性。从理论上讲，我们显示了扩散器作为序列到序列建模的通用序列近似器的表现力，并通过从光谱的角度分析图形扩展器属性来研究其近似全注意的能力。在实验上，我们通过广泛的评估来研究扩散器的有效性，包括语言建模，图像建模和远距离竞技场（LRA）。评估结果表明，与最先进的基准相比，扩散器的文本分类任务平均取得了0.94％的改善，LRA的改善为2.30％，其中1.67 $ \ times $存储器的节省相比，这表明了表达和效率方面的差异性表现出色。

Efficient Transformers have been developed for long sequence modeling, due to their subquadratic memory and time complexity. Sparse Transformer is a popular approach to improving the efficiency of Transformers by restricting self-attention to locations specified by the predefined sparse patterns. However, leveraging sparsity may sacrifice expressiveness compared to full-attention, when important token correlations are multiple hops away. To combine advantages of both the efficiency of sparse transformer and the expressiveness of full-attention Transformer, we propose \textit{Diffuser}, a new state-of-the-art efficient Transformer. Diffuser incorporates all token interactions within one attention layer while maintaining low computation and memory costs. The key idea is to expand the receptive field of sparse attention using Attention Diffusion, which computes multi-hop token correlations based on all paths between corresponding disconnected tokens, besides attention among neighboring tokens. Theoretically, we show the expressiveness of Diffuser as a universal sequence approximator for sequence-to-sequence modeling, and investigate its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective. Experimentally, we investigate the effectiveness of Diffuser with extensive evaluations, including language modeling, image modeling, and Long Range Arena (LRA). Evaluation results show that Diffuser achieves improvements by an average of 0.94% on text classification tasks and 2.30% on LRA, with 1.67$\times$ memory savings compared to state-of-the-art benchmarks, which demonstrates superior performance of Diffuser in both expressiveness and efficiency aspects.

下载PDF全文

下载文献需遵守相关版权规定

论文标题