转换和讲述：实体感知新闻图像字幕

论文标题

转换和讲述：实体感知新闻图像字幕

Transform and Tell: Entity-Aware News Image Captioning

论文作者

Tran, Alasdair, Mathews, Alexander, Xie, Lexing

论文摘要

我们提出了一个端到端模型，该模型生成了嵌入新闻文章中的图像的标题。新闻图像提出了两个关键挑战：它们依靠现实世界的知识，尤其是关于指定实体的知识。他们通常具有语言上丰富的字幕，其中包括罕见的单词。我们通过将字幕中的单词与图像中的面孔和对象相关联，通过多模式，多头的注意机制来应对第一个挑战。我们使用最先进的变压器语言模型来应对第二项挑战，该模型使用字节对编码来生成字幕作为一系列单词部分。在GoodNews数据集上，我们的模型的表现优于先前的艺术状态，其中四倍在苹果酒评分中（13至54）。这种性能增长来自语言模型，单词表示，图像嵌入，面部嵌入，对象嵌入以及神经网络设计的改进的独特组合。我们还介绍了比GoodNews大70％的NYTIMES800K数据集，具有更高的文章质量，并将图像的位置包括在文章中，作为附加的上下文提示。

We propose an end-to-end model which generates captions for images embedded in news articles. News images present two key challenges: they rely on real-world knowledge, especially about named entities; and they typically have linguistically rich captions that include uncommon words. We address the first challenge by associating words in the caption with faces and objects in the image, via a multi-modal, multi-head attention mechanism. We tackle the second challenge with a state-of-the-art transformer language model that uses byte-pair-encoding to generate captions as a sequence of word parts. On the GoodNews dataset, our model outperforms the previous state of the art by a factor of four in CIDEr score (13 to 54). This performance gain comes from a unique combination of language models, word representation, image embeddings, face embeddings, object embeddings, and improvements in neural network design. We also introduce the NYTimes800k dataset which is 70% larger than GoodNews, has higher article quality, and includes the locations of images within articles as an additional contextual cue.

下载PDF全文

下载文献需遵守相关版权规定

论文标题