论文标题

转换和讲述:实体感知新闻图像字幕

Transform and Tell: Entity-Aware News Image Captioning

论文作者

Tran, Alasdair, Mathews, Alexander, Xie, Lexing

论文摘要

我们提出了一个端到端模型,该模型生成了嵌入新闻文章中的图像的标题。新闻图像提出了两个关键挑战:它们依靠现实世界的知识,尤其是关于指定实体的知识。他们通常具有语言上丰富的字幕,其中包括罕见的单词。我们通过将字幕中的单词与图像中的面孔和对象相关联,通过多模式,多头的注意机制来应对第一个挑战。我们使用最先进的变压器语言模型来应对第二项挑战,该模型使用字节对编码来生成字幕作为一系列单词部分。在GoodNews数据集上,我们的模型的表现优于先前的艺术状态,其中四倍在苹果酒评分中(13至54)。这种性能增长来自语言模型,单词表示,图像嵌入,面部嵌入,对象嵌入以及神经网络设计的改进的独特组合。我们还介绍了比GoodNews大70%的NYTIMES800K数据集,具有更高的文章质量,并将图像的位置包括在文章中,作为附加的上下文提示。

We propose an end-to-end model which generates captions for images embedded in news articles. News images present two key challenges: they rely on real-world knowledge, especially about named entities; and they typically have linguistically rich captions that include uncommon words. We address the first challenge by associating words in the caption with faces and objects in the image, via a multi-modal, multi-head attention mechanism. We tackle the second challenge with a state-of-the-art transformer language model that uses byte-pair-encoding to generate captions as a sequence of word parts. On the GoodNews dataset, our model outperforms the previous state of the art by a factor of four in CIDEr score (13 to 54). This performance gain comes from a unique combination of language models, word representation, image embeddings, face embeddings, object embeddings, and improvements in neural network design. We also introduce the NYTimes800k dataset which is 70% larger than GoodNews, has higher article quality, and includes the locations of images within articles as an additional contextual cue.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源