具有可训练表示的稀疏变压器模型

论文标题

具有可训练表示的稀疏变压器模型

Sparsifying Transformer Models with Trainable Representation Pooling

论文作者

Pietruszka, Michał, Borchmann, Łukasz, Garncarek, Łukasz

论文摘要

我们提出了一种新的方法，可以通过学习在训练过程中选择最信息的令牌表示，从而占用变压器模型中的注意力，从而关注输入的特定于任务部分。由于可训练的顶级$ K $运算符，缩短了二次时间和记忆复杂性。我们对充满挑战的长期文档摘要任务的实验表明，即使我们的简单基线也与当前的SOTA相当，并且通过可训练的池，我们可以保持其最高质量，而在培训期间的$ 1.8 \ times $ $ $ 4.5 \ times $ $ 4.5 \ times $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ 13 \ times $ 13 \ times timple $ 13 \ times times in Expitiation $在解码器中的计算效率更高。

We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process, thus focusing on the task-specific parts of an input. A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top-$k$ operator. Our experiments on a challenging long document summarization task show that even our simple baseline performs comparably to the current SOTA, and with trainable pooling, we can retain its top quality, while being $1.8\times$ faster during training, $4.5\times$ faster during inference, and up to $13\times$ more computationally efficient in the decoder.

下载PDF全文

下载文献需遵守相关版权规定

论文标题