论文标题
稀疏的sindhorn注意力
Sparse Sinkhorn Attention
论文作者
论文摘要
我们提出了稀疏的sindhorn注意力,这是一种新的高效且稀疏的学习参加的方法。我们的方法基于内部表示形式的可区分排序。具体而言,我们引入了一个元分选网络,该网络学会了为序列生成潜在排列。给定排序的序列,我们只能使用本地窗口来计算准全球关注,从而提高注意力模块的内存效率。为此,我们提出了新的算法创新,例如因果关系平衡和分类,这是一种动态序列截断方法,用于定制sndhorn注意的编码和/或解码目的。通过有关算法SEQ2SEQ排序,语言建模,像素图像生成,文档分类和自然语言推断的广泛实验,我们证明了我们的记忆有效的sindhorn注意方法具有竞争性的,它具有香草的竞争,并且最近均超过了最近提出的有效变压器模型,例如稀疏变压器。
We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend. Our method is based on differentiable sorting of internal representations. Concretely, we introduce a meta sorting network that learns to generate latent permutations over sequences. Given sorted sequences, we are then able to compute quasi-global attention with only local windows, improving the memory efficiency of the attention module. To this end, we propose new algorithmic innovations such as Causal Sinkhorn Balancing and SortCut, a dynamic sequence truncation method for tailoring Sinkhorn Attention for encoding and/or decoding purposes. Via extensive experiments on algorithmic seq2seq sorting, language modeling, pixel-wise image generation, document classification and natural language inference, we demonstrate that our memory efficient Sinkhorn Attention method is competitive with vanilla attention and consistently outperforms recently proposed efficient Transformer models such as Sparse Transformers.