du-vlg：通过双序列到预训练的统一视力和语言生成

论文标题

du-vlg：通过双序列到预训练的统一视力和语言生成

DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training

论文作者

Huang, Luyang, Niu, Guocheng, Liu, Jiachen, Xiao, Xinyan, Wu, Hua

论文摘要

由于模型结构和预训练目标的局限性，现有的视觉和语言生成模型无法通过双向生成利用配对图像和文本。在本文中，我们提出了DU-VLG，该框架将视觉和语言生成统一为序列产生问题。 DU-VLG接受了新颖的双重预训练任务的培训：多模式的自动编码器任务和模态翻译任务。为了弥合图像理解与产生之间的差距，我们进一步设计了一种新颖的承诺损失。我们比较图像字幕和文本对图像生成数据集上的预训练目标。结果表明，du-vlg的性能比接受单向生成目标训练的变体或没有承诺损失的变体。与以前的三个视觉和语言生成任务相比，我们还获得了更高的分数。此外，人类法官进一步证实，我们的模型产生了真实和相关的图像以及忠实而有益的标题。

Due to the limitations of the model structure and pre-training objectives, existing vision-and-language generation models cannot utilize pair-wise images and text through bi-directional generation. In this paper, we propose DU-VLG, a framework which unifies vision-and-language generation as sequence generation problems. DU-VLG is trained with novel dual pre-training tasks: multi-modal denoising autoencoder tasks and modality translation tasks. To bridge the gap between image understanding and generation, we further design a novel commitment loss. We compare pre-training objectives on image captioning and text-to-image generation datasets. Results show that DU-VLG yields better performance than variants trained with uni-directional generation objectives or the variant without the commitment loss. We also obtain higher scores compared to previous state-of-the-art systems on three vision-and-language generation tasks. In addition, human judges further confirm that our model generates real and relevant images as well as faithful and informative captions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题