论文标题
XGPT:图像字幕的跨模式生成预训练
XGPT: Cross-modal Generative Pre-Training for Image Captioning
论文作者
论文摘要
尽管许多基于BERT的跨模式预训练的模型在理解诸如Image-Text检索和VQA之类的下游方面产生了出色的结果,但它们不能直接应用于生成任务。在本文中,我们提出了XGPT,这是一种跨模式生成预训练的新方法,用于图像字幕,该方法旨在通过三个新颖的一代任务,包括图像条件掩盖的语言建模(IMLM),图像条件的自动编码(IDA)和文本状态图像特征生成(Tifg)。结果,可以对预训练的XGPT进行微调,而无需任何特定于任务的体系结构修改以创建用于图像字幕的最新模型。实验表明,XGPT在基准数据集上获得了新的最新结果,包括可可字幕和Flickr30k字幕。我们还使用XGPT来生成新的图像标题作为图像检索任务的数据增强,并对所有召回指标实现了重大改进。
While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through three novel generation tasks, including Image-conditioned Masked Language Modeling (IMLM), Image-conditioned Denoising Autoencoding (IDA), and Text-conditioned Image Feature Generation (TIFG). As a result, the pre-trained XGPT can be fine-tuned without any task-specific architecture modifications to create state-of-the-art models for image captioning. Experiments show that XGPT obtains new state-of-the-art results on the benchmark datasets, including COCO Captions and Flickr30k Captions. We also use XGPT to generate new image captions as data augmentation for the image retrieval task and achieve significant improvement on all recall metrics.