图像字幕的标准化和几何学意识到的自我注意力网络

论文标题

图像字幕的标准化和几何学意识到的自我注意力网络

Normalized and Geometry-Aware Self-Attention Network for Image Captioning

论文作者

Guo, Longteng, Liu, Jing, Zhu, Xinxin, Yao, Peng, Lu, Shichen, Lu, Hanqing

论文摘要

自我注意（SA）网络在图像字幕上显示出深刻的价值。在本文中，我们将SA从两个方面改进，以促进图像字幕的性能。首先，我们提出了归一化的自我注意事项（NSA），这是SA的重新聚集化，从而带来了SA内的归一化好处。虽然以前仅在SA之外应用归一化，但我们引入了一种新型的归一化方法，并证明在SA内的隐藏激活上执行它是可能的和有益的。其次，为了补偿变压器的主要极限，即它无法建模输入对象的几何结构，我们提出了一类几何学意识到的自我注意力（GSA），该几何学自我注意力（GSA）扩展了SA以明确有效地考虑图像中对象之间的相对几何关系。要构建我们的图像字幕模型，我们将两个模块组合在一起，并将其应用于Vanilla自我发项网络。我们对MS-Coco图像字幕的提案进行了广泛的评估，并且在与最先进的方法相比时，可以实现出色的结果。进一步实验了三个具有挑战性的任务，即视频字幕，机器翻译和视觉问题回答，显示了我们方法的通用性。

Self-attention (SA) network has shown profound value in image captioning. In this paper, we improve SA from two aspects to promote the performance of image captioning. First, we propose Normalized Self-Attention (NSA), a reparameterization of SA that brings the benefits of normalization inside SA. While normalization is previously only applied outside SA, we introduce a novel normalization method and demonstrate that it is both possible and beneficial to perform it on the hidden activations inside SA. Second, to compensate for the major limit of Transformer that it fails to model the geometry structure of the input objects, we propose a class of Geometry-aware Self-Attention (GSA) that extends SA to explicitly and efficiently consider the relative geometry relations between the objects in the image. To construct our image captioning model, we combine the two modules and apply it to the vanilla self-attention network. We extensively evaluate our proposals on MS-COCO image captioning dataset and superior results are achieved when comparing to state-of-the-art approaches. Further experiments on three challenging tasks, i.e. video captioning, machine translation, and visual question answering, show the generality of our methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题