论文标题
具有可训练表示的稀疏变压器模型
Sparsifying Transformer Models with Trainable Representation Pooling
论文作者
论文摘要
我们提出了一种新的方法,可以通过学习在训练过程中选择最信息的令牌表示,从而占用变压器模型中的注意力,从而关注输入的特定于任务部分。由于可训练的顶级$ K $运算符,缩短了二次时间和记忆复杂性。我们对充满挑战的长期文档摘要任务的实验表明,即使我们的简单基线也与当前的SOTA相当,并且通过可训练的池,我们可以保持其最高质量,而在培训期间的$ 1.8 \ times $ $ $ 4.5 \ times $ $ 4.5 \ times $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ 13 \ times $ 13 \ times timple $ 13 \ times times in Expitiation $在解码器中的计算效率更高。
We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process, thus focusing on the task-specific parts of an input. A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top-$k$ operator. Our experiments on a challenging long document summarization task show that even our simple baseline performs comparably to the current SOTA, and with trainable pooling, we can retain its top quality, while being $1.8\times$ faster during training, $4.5\times$ faster during inference, and up to $13\times$ more computationally efficient in the decoder.