基于视觉变压器的模型，用于将一组图像描述为故事

论文标题

基于视觉变压器的模型，用于将一组图像描述为故事

Vision Transformer Based Model for Describing a Set of Images as a Story

论文作者

Malakan, Zainy M., Hassan, Ghulam Mubashar, Mian, Ajmal

论文摘要

视觉讲故事是从一组图像中形成多句故事的过程。适当地包括视觉变化和输入图像中捕获的上下文信息是视觉讲故事的最具挑战性的方面之一。因此，从一系列图像中发展出的故事通常缺乏凝聚力，相关性和语义关系。在本文中，我们提出了一个基于视觉变压器的新型模型，用于将一组图像描述为故事。提出的方法使用视觉变压器（VIT）提取输入图像的不同特征。首先，将输入图像分为16x16贴片，并将其捆绑成扁平贴片的线性投影。从单个图像到多个图像贴片的转换捕获了输入视觉模式的视觉变化。这些特征用作双向LSTM的输入，该输入是序列编码器的一部分。这捕获了所有图像补丁的过去和将来图像上下文。然后，实施了注意机制，并用于增加对语言模型（即mogrifier-lstm）的歧视能力。使用视觉讲故事数据集（VIST）评估了我们提出的模型的性能，结果表明，我们的模型的表现优于最新模型的当前状态。

Visual Story-Telling is the process of forming a multi-sentence story from a set of images. Appropriately including visual variation and contextual information captured inside the input images is one of the most challenging aspects of visual storytelling. Consequently, stories developed from a set of images often lack cohesiveness, relevance, and semantic relationship. In this paper, we propose a novel Vision Transformer Based Model for describing a set of images as a story. The proposed method extracts the distinct features of the input images using a Vision Transformer (ViT). Firstly, input images are divided into 16X16 patches and bundled into a linear projection of flattened patches. The transformation from a single image to multiple image patches captures the visual variety of the input visual patterns. These features are used as input to a Bidirectional-LSTM which is part of the sequence encoder. This captures the past and future image context of all image patches. Then, an attention mechanism is implemented and used to increase the discriminatory capacity of the data fed into the language model, i.e. a Mogrifier-LSTM. The performance of our proposed model is evaluated using the Visual Story-Telling dataset (VIST), and the results show that our model outperforms the current state of the art models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题