论文标题
归一化注意而没有概率笼
Normalized Attention Without Probability Cage
论文作者
论文摘要
注意体系结构被广泛使用;他们最近随着变形金刚产生最先进的结果而获得了新的知名度。然而,软玛克斯注意的几何含义在很大程度上尚未探索。在这项工作中,我们强调了将注意力权重限制为概率单纯形和结果矢量的凸壳的局限性。我们表明,变压器是序列长度依赖于初始化和对比度变压器在简单的最大和总和 - 很少报道的两个强基础上的序列长度。我们建议用标准化替换自我注意力的软磁性,从而产生超参数和数据偏置鲁棒,通常适用的体系结构。我们通过25,000多个训练有素的模型的经验结果来支持我们的见解。所有结果和实现都可以使用。
Attention architectures are widely used; they recently gained renewed popularity with Transformers yielding a streak of state of the art results. Yet, the geometrical implications of softmax-attention remain largely unexplored. In this work we highlight the limitations of constraining attention weights to the probability simplex and the resulting convex hull of value vectors. We show that Transformers are sequence length dependent biased towards token isolation at initialization and contrast Transformers to simple max- and sum-pooling - two strong baselines rarely reported. We propose to replace the softmax in self-attention with normalization, yielding a hyperparameter and data-bias robust, generally applicable architecture. We support our insights with empirical results from more than 25,000 trained models. All results and implementations are made available.