论文标题
du-vlg:通过双序列到预训练的统一视力和语言生成
DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training
论文作者
论文摘要
由于模型结构和预训练目标的局限性,现有的视觉和语言生成模型无法通过双向生成利用配对图像和文本。在本文中,我们提出了DU-VLG,该框架将视觉和语言生成统一为序列产生问题。 DU-VLG接受了新颖的双重预训练任务的培训:多模式的自动编码器任务和模态翻译任务。为了弥合图像理解与产生之间的差距,我们进一步设计了一种新颖的承诺损失。我们比较图像字幕和文本对图像生成数据集上的预训练目标。结果表明,du-vlg的性能比接受单向生成目标训练的变体或没有承诺损失的变体。与以前的三个视觉和语言生成任务相比,我们还获得了更高的分数。此外,人类法官进一步证实,我们的模型产生了真实和相关的图像以及忠实而有益的标题。
Due to the limitations of the model structure and pre-training objectives, existing vision-and-language generation models cannot utilize pair-wise images and text through bi-directional generation. In this paper, we propose DU-VLG, a framework which unifies vision-and-language generation as sequence generation problems. DU-VLG is trained with novel dual pre-training tasks: multi-modal denoising autoencoder tasks and modality translation tasks. To bridge the gap between image understanding and generation, we further design a novel commitment loss. We compare pre-training objectives on image captioning and text-to-image generation datasets. Results show that DU-VLG yields better performance than variants trained with uni-directional generation objectives or the variant without the commitment loss. We also obtain higher scores compared to previous state-of-the-art systems on three vision-and-language generation tasks. In addition, human judges further confirm that our model generates real and relevant images as well as faithful and informative captions.