学习将视觉语言神经模块串联以进行图像字幕

论文标题

学习将视觉语言神经模块串联以进行图像字幕

Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning

论文作者

Yang, Xu, Zhang, Hanwang, Gao, Chongyang, Cai, Jianfei

论文摘要

人类倾向于将句子分解为不同的部分，例如\ textsc {sth do sth在某个地方}，然后用某些内容填充每个部分。受此启发的启发，我们遵循\ textit {模块化设计的原理}提出了一个新颖的图像标题：学会将视觉语言神经模块（CVLNM）组合在一起。与VQA中的\ re {广泛使用}的神经模块网络不同，该语言（\ ie，询问）是完全可观察到的，\ re {与视觉语言模块相交的任务更具挑战性。}这是因为语言仅是部分可观察到的，我们需要在该模块中动态地组合模块的图像caption cappainsing of the模块。 To sum up, we make the following technical contributions to design and train our CVLNM: 1) \textit{distinguishable module design} -- \re{four modules in the encoder} including one linguistic module for function words and three visual modules for different content words (\ie, noun, adjective, and verb) and another linguistic one in the decoder for commonsense reasoning, 2) a self-attention based \ textIt {模块控制器}用于鲁棒化视觉推理，3）在模块控制器上施加的基于语音的\ textit {语法损失}，以进一步正规化我们的CVLNM的训练。在MS-Coco数据集上进行的广泛实验表明，我们的CVLNM更有效，例如，获得了新的最先进的129.5 Cider-D，并且更强大的\ EG，在较少的培训样本时，对数据集偏见过高的可能性较小，并且遭受少的时间。代码可在\ url {https://github.com/gcyzsl/cvlmn}中找到。

Humans tend to decompose a sentence into different parts like \textsc{sth do sth at someplace} and then fill each part with certain content. Inspired by this, we follow the \textit{principle of modular design} to propose a novel image captioner: learning to Collocate Visual-Linguistic Neural Modules (CVLNM). Unlike the \re{widely used} neural module networks in VQA, where the language (\ie, question) is fully observable, \re{the task of collocating visual-linguistic modules is more challenging.} This is because the language is only partially observable, for which we need to dynamically collocate the modules during the process of image captioning. To sum up, we make the following technical contributions to design and train our CVLNM: 1) \textit{distinguishable module design} -- \re{four modules in the encoder} including one linguistic module for function words and three visual modules for different content words (\ie, noun, adjective, and verb) and another linguistic one in the decoder for commonsense reasoning, 2) a self-attention based \textit{module controller} for robustifying the visual reasoning, 3) a part-of-speech based \textit{syntax loss} imposed on the module controller for further regularizing the training of our CVLNM. Extensive experiments on the MS-COCO dataset show that our CVLNM is more effective, \eg, achieving a new state-of-the-art 129.5 CIDEr-D, and more robust, \eg, being less likely to overfit to dataset bias and suffering less when fewer training samples are available. Codes are available at \url{https://github.com/GCYZSL/CVLMN}

下载PDF全文

下载文献需遵守相关版权规定

论文标题