论文标题
没有解释的注意
Attention that does not Explain Away
论文作者
论文摘要
基于变压器体系结构的模型比基于竞争架构的大量任务取得了更好的准确性。变压器的一个独特功能是其自我发项机制的普遍应用,该机制允许在任意距离处进行自由信息流。遵循高斯混合模型对注意力的概率观点,我们发现了经验证据表明,变压器注意力倾向于“解释”某些输入神经元。为了弥补这一点,我们提出了一个双重归一化的注意力方案,该方案易于实施,并提供了理论保证,可以避免在不引入大量计算或记忆成本的情况下避免“解释”效果。从经验上讲,我们表明新的注意力方案在几个众所周知的基准上提高了性能。
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. Following a probabilistic view of the attention via the Gaussian mixture model, we find empirical evidence that the Transformer attention tends to "explain away" certain input neurons. To compensate for this, we propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect without introducing significant computational or memory cost. Empirically, we show that the new attention schemes result in improved performance on several well-known benchmarks.