论文标题

图像字幕的X线性注意网络

X-Linear Attention Networks for Image Captioning

论文作者

Pan, Yingwei, Yao, Ting, Li, Yehao, Mei, Tao

论文摘要

在细颗粒的视觉识别和视觉问题回答方面的最新进展具有双线性池,这些池有效地建模了2 $^{nd} $订购跨多模式输入的订单交互。然而,没有证据支持与图像字幕的注意机制同时建立此类互动。在本文中,我们引入了一个统一的注意力块-X线性注意块,该块完全采用了双线性池,以选择性地利用视觉信息或执行多模式推理。从技术上讲,X线性注意力块同时利用空间和频道双线性注意分布,以捕获输入单模式或多模式特征之间的2 $^{nd} $顺序相互作用。通过堆叠多个X线性注意块并分别以无参数方式将块与指数线性单元(ELU)配备较高甚至无限阶特征交互。此外,我们提出了X线性注意网络(称为X-LAN),这些网络在新颖地将X线性注意块(S)整合到图像字幕模型的图像编码器和句子解码器中,以利用高阶内和模式间相互作用。可可基准的实验表明,我们的X-lan在可可karpathy测试拆分上获得了迄今为止最佳发表的苹果酒表现的最佳苹果酒。当进一步赋予变压器具有X线性注意块时,苹果酒将提高到132.8%。源代码可在\ url {https://github.com/panda-peter/image-captioning}中获得。

Recent progress on fine-grained visual recognition and visual question answering has featured Bilinear Pooling, which effectively models the 2$^{nd}$ order interactions across multi-modal inputs. Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning. In this paper, we introduce a unified attention block -- X-Linear attention block, that fully employs bilinear pooling to selectively capitalize on visual information or perform multi-modal reasoning. Technically, X-Linear attention block simultaneously exploits both the spatial and channel-wise bilinear attention distributions to capture the 2$^{nd}$ order interactions between the input single-modal or multi-modal features. Higher and even infinity order feature interactions are readily modeled through stacking multiple X-Linear attention blocks and equipping the block with Exponential Linear Unit (ELU) in a parameter-free fashion, respectively. Furthermore, we present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates X-Linear attention block(s) into image encoder and sentence decoder of image captioning model to leverage higher order intra- and inter-modal interactions. The experiments on COCO benchmark demonstrate that our X-LAN obtains to-date the best published CIDEr performance of 132.0% on COCO Karpathy test split. When further endowing Transformer with X-Linear attention blocks, CIDEr is boosted up to 132.8%. Source code is available at \url{https://github.com/Panda-Peter/image-captioning}.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源