论文标题
DCT形式:具有离散余弦变换的有效自我注意
DCT-Former: Efficient Self-Attention with Discrete Cosine Transform
论文作者
论文摘要
自引入以来,Trasformer体系结构成为自然语言处理的主要体系结构以及最近的计算机视觉应用程序。这个“全面”架构家族的内在局限性源于DOT-Proprouting注意的计算,该计算在记忆消耗和操作数量中均增长为$ O(n^2)$,其中$ n $代表输入序列长度,因此限制了需要建模非常长序列的应用程序。到目前为止,文献中已经提出了几种方法来减轻此问题,并取得了不同程度的成功。我们的想法从有损数据压缩的世界(例如JPEG算法)中汲取灵感,以通过利用离散余弦变换的属性来得出注意模块的近似值。大量的实验部分表明,我们的方法对相同性能的记忆更少,同时也大大减少了推理时间。这使其在嵌入式平台上的实时环境中特别适合。此外,我们假设我们的研究结果可能是一个更广泛的深层神经模型家庭的起点,并且记忆力降低。该实施将在https://github.com/cscribano/dct-former-public上公开提供
Since their introduction the Trasformer architectures emerged as the dominating architectures for both natural language processing and, more recently, computer vision applications. An intrinsic limitation of this family of "fully-attentive" architectures arises from the computation of the dot-product attention, which grows both in memory consumption and number of operations as $O(n^2)$ where $n$ stands for the input sequence length, thus limiting the applications that require modeling very long sequences. Several approaches have been proposed so far in the literature to mitigate this issue, with varying degrees of success. Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module by leveraging the properties of the Discrete Cosine Transform. An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time. This makes it particularly suitable in real-time contexts on embedded platforms. Moreover, we assume that the results of our research might serve as a starting point for a broader family of deep neural models with reduced memory footprint. The implementation will be made publicly available at https://github.com/cscribano/DCT-Former-Public