可控的图像字幕

论文标题

可控的图像字幕

Controllable Image Captioning

论文作者

Maxwell, Luka

论文摘要

最先进的图像字幕可以生成准确的句子，以序列描述序列以进行序列方式描述图像，而无需考虑可控性和解释性。但是，这远非使图像字幕广泛用作图像，可以根据目的和手头上下文来以无限的方式解释。实现可控性很重要，尤其是当图像标题者使用不同方式解释图像的方式时。在本文中，我们介绍了一个新颖的图像字幕框架，该框架可以通过捕获词性词性标签和语义之间的共同依赖性来产生各种描述。我们的模型将连续变量之间的直接依赖性分开。这样，它允许解码器详尽地搜索潜在的言论部分选择，同时保持与POS词汇大小成正比的解码速度。鉴于以一系列语音标签的序列形式的控制信号，我们提出了一种通过变压器网络生成字幕的方法，该方法可以根据言论部分的输入部分预测单词。公开可用数据集的实验表明，我们的模型在生成具有较高品质的多种图像标题方面的最先进方法。

State-of-the-art image captioners can generate accurate sentences to describe images in a sequence to sequence manner without considering the controllability and interpretability. This, however, is far from making image captioning widely used as an image can be interpreted in infinite ways depending on the target and the context at hand. Achieving controllability is important especially when the image captioner is used by different people with different way of interpreting the images. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by capturing the co-dependence between Part-Of-Speech tags and semantics. Our model decouples direct dependence between successive variables. In this way, it allows the decoder to exhaustively search through the latent Part-Of-Speech choices, while keeping decoding speed proportional to the size of the POS vocabulary. Given a control signal in the form of a sequence of Part-Of-Speech tags, we propose a method to generate captions through a Transformer network, which predicts words based on the input Part-Of-Speech tag sequences. Experiments on publicly available datasets show that our model significantly outperforms state-of-the-art methods on generating diverse image captions with high qualities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题