论文标题
DBA:具有动态双线性低级别注意力的有效变压器
DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention
论文作者
论文摘要
已经进行了许多研究,以提高变压器从二次到线性的效率。其中,低级别的方法旨在学习投影矩阵以压缩序列长度。但是,一旦学到了投影矩阵,它们就会固定,该矩阵以同一位置的代币的专用系数压缩序列长度。采用这种输入不变的投影忽略了以下事实:序列中最有用的部分因序列而异,因此未能保留在各种位置的最有用的信息。此外,以前的有效变压器仅专注于序列长度的影响,同时忽略了隐藏状态维度的效果。为了解决上述问题,我们提出了一种有效但有效的注意机制,即动态双线性低级别注意力(DBA),该机制通过输入敏感的动态动态矩阵来压缩序列长度,并通过保持序列长度和隐藏状态的状态尺寸,同时保持状态的状态效果,从而实现线性投影和空间的复杂性。具体而言,我们首先在理论上证明了序列长度可以从信息理论的新角度进行非毁灭性地压缩,并通过输入序列动态确定压缩矩阵。此外,我们表明可以通过扩展Johnson-Lindenstrauss引理来近似隐藏的状态维度,从而以双线性形式优化注意力。理论分析表明,DBA精通跨注意问题中的高阶关系。对具有不同序列长度条件的任务进行的实验表明,与各种强大的基线相比,DBA可以实现最先进的性能,同时以更高的速度保持较少的记忆消耗。
Many studies have been conducted to improve the efficiency of Transformer from quadric to linear. Among them, the low-rank-based methods aim to learn the projection matrices to compress the sequence length. However, the projection matrices are fixed once they have been learned, which compress sequence length with dedicated coefficients for tokens in the same position. Adopting such input-invariant projections ignores the fact that the most informative part of a sequence varies from sequence to sequence, thus failing to preserve the most useful information that lies in varied positions. In addition, previous efficient Transformers only focus on the influence of sequence length while neglecting the effect of hidden state dimension. To address the aforementioned problems, we present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA), which compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity by jointly optimizing the sequence length and hidden state dimension while maintaining state-of-the-art performance. Specifically, we first theoretically demonstrate that the sequence length can be compressed non-destructively from a novel perspective of information theory, with compression matrices dynamically determined by the input sequence. Furthermore, we show that the hidden state dimension can be approximated by extending the Johnson-Lindenstrauss lemma, optimizing the attention in bilinear form. Theoretical analysis shows that DBA is proficient in capturing high-order relations in cross-attention problems. Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance compared with various strong baselines while maintaining less memory consumption with higher speed.