按照您的意愿说：通过抽象场景图对图像字幕生成的细粒度控制

论文标题

按照您的意愿说：通过抽象场景图对图像字幕生成的细粒度控制

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

论文作者

Chen, Shizhe, Jin, Qin, Wang, Peng, Wu, Qi

论文摘要

人类可以根据自己的意愿描述具有粗到细节的图像内容。但是，大多数图像字幕模型都是意图无关的，无法根据不同的用户意图产生不同的描述。在这项工作中，我们提出了抽象场景图（ASG）结构，以在细粒度级别表示用户意图，并控制生成的描述应为什么以及如何详细说明。 ASG是一个有向图，由三种类型的\ textbf {Abstract nodes}（对象，属性，关系）组成，该图形基于图像，而没有任何具体的语义标签。因此，很容易手动或自动获得。从ASG中，我们提出了一个新颖的ASG2Caption模型，该模型能够识别图中的用户意图和语义，因此根据图结构生成所需的字幕。与在Visualgenome和Mscoco数据集上精心设计的基线相比，我们的模型可以在ASG上实现更好的可控性调节。它还可以通过自动对各种ASG作为控制信号进行采样来显着提高标题多样性。

Humans are able to describe image contents with coarse to fine details as they wish. However, most image captioning models are intention-agnostic which can not generate diverse descriptions according to different user intentions initiatively. In this work, we propose the Abstract Scene Graph (ASG) structure to represent user intention in fine-grained level and control what and how detailed the generated description should be. The ASG is a directed graph consisting of three types of \textbf{abstract nodes} (object, attribute, relationship) grounded in the image without any concrete semantic labels. Thus it is easy to obtain either manually or automatically. From the ASG, we propose a novel ASG2Caption model, which is able to recognise user intentions and semantics in the graph, and therefore generate desired captions according to the graph structure. Our model achieves better controllability conditioning on ASGs than carefully designed baselines on both VisualGenome and MSCOCO datasets. It also significantly improves the caption diversity via automatically sampling diverse ASGs as control signals.

下载PDF全文

下载文献需遵守相关版权规定

论文标题