Lipschitz自我注意的常数

论文标题

Lipschitz自我注意的常数

The Lipschitz Constant of Self-Attention

论文作者

Kim, Hyunjik, Papamakarios, George, Mnih, Andriy

论文摘要

在深度学习的各种情况下，已经探索了神经网络的Lipschitz常数，例如可证明的对抗性鲁棒性，估计Wasserstein距离，稳定gan的训练以及对可逆的神经网络进行配方。这样的作品集中在界定由线性图和指数非线性组成的完全连接或卷积网络的Lipschitz常数。在本文中，我们研究了自我注意的Lipschitz常数，这是一种非线性神经网络模块，广泛用于序列建模。我们证明，对于无限的输入域而言，标准的点产物自我注意力不是Lipschitz，并且提出了一种替代的L2自我注意力，即Lipschitz。我们在L2自我注意力的Lipschitz常数上得出了上限，并为其渐近紧密性提供了经验证据。为了证明我们的理论工作的实际相关性，我们制定了可逆的自我注意事项，并将其用于基于变压器的架构进行角色级的语言建模任务。

Lipschitz constants of neural networks have been explored in various contexts in deep learning, such as provable adversarial robustness, estimating Wasserstein distance, stabilising training of GANs, and formulating invertible neural networks. Such works have focused on bounding the Lipschitz constant of fully connected or convolutional networks, composed of linear maps and pointwise non-linearities. In this paper, we investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling. We prove that the standard dot-product self-attention is not Lipschitz for unbounded input domain, and propose an alternative L2 self-attention that is Lipschitz. We derive an upper bound on the Lipschitz constant of L2 self-attention and provide empirical evidence for its asymptotic tightness. To demonstrate the practical relevance of our theoretical work, we formulate invertible self-attention and use it in a Transformer-based architecture for a character-level language modelling task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题