通过比较和重新加权，在独特的图像字幕上

论文标题

通过比较和重新加权，在独特的图像字幕上

On Distinctive Image Captioning via Comparing and Reweighting

论文作者

Wang, Jiuniu, Xu, Wenjia, Wang, Qingzhong, Chan, Antoni B.

论文摘要

最近的图像字幕模型正在基于流行的指标，即BLEU，苹果酒和香料取得令人印象深刻的结果。但是，专注于最受欢迎的指标，这些指标仅考虑生成的字幕和人类注释之间的重叠可能会导致使用常用单词和短语，而这些单词和短语缺乏独特性，即许多相似的图像具有相同的字幕。在本文中，我们旨在通过将图像标题与一组相似图像进行比较和重新持续来提高图像标题的独特性。首先，我们提出了一个独特的度量标准 - 集合苹果酒（Ciderbtw），以评估标题相对于相似图像的标题的独特性。我们的指标表明，根据独特性，MSCOCO数据集中每个图像的人类注释不是等效的。但是，以前的作品通常在训练期间平均对人类注释进行治疗，这可能是产生较低独特字幕的原因。相比之下，我们根据培训期间的独特性将每个基本真相标题都重新授予。我们进一步整合了长尾的重量策略，以突出包含更多信息的稀有单词，并将类似图像集的字幕采样为负示例，以鼓励生成的句子是唯一的。最后，进行了广泛的实验，表明我们提出的方法可显着提高独特性（通过ciderbtw和检索指标衡量）和精度（例如，通过苹果酒测量）对于多种图像字幕底座。通过用户研究进一步证实了这些结果。

Recent image captioning models are achieving impressive results based on popular metrics, i.e., BLEU, CIDEr, and SPICE. However, focusing on the most popular metrics that only consider the overlap between the generated captions and human annotation could result in using common words and phrases, which lacks distinctiveness, i.e., many similar images have the same caption. In this paper, we aim to improve the distinctiveness of image captions via comparing and reweighting with a set of similar images. First, we propose a distinctiveness metric -- between-set CIDEr (CIDErBtw) to evaluate the distinctiveness of a caption with respect to those of similar images. Our metric reveals that the human annotations of each image in the MSCOCO dataset are not equivalent based on distinctiveness; however, previous works normally treat the human annotations equally during training, which could be a reason for generating less distinctive captions. In contrast, we reweight each ground-truth caption according to its distinctiveness during training. We further integrate a long-tailed weight strategy to highlight the rare words that contain more information, and captions from the similar image set are sampled as negative examples to encourage the generated sentence to be unique. Finally, extensive experiments are conducted, showing that our proposed approach significantly improves both distinctiveness (as measured by CIDErBtw and retrieval metrics) and accuracy (e.g., as measured by CIDEr) for a wide variety of image captioning baselines. These results are further confirmed through a user study.

下载PDF全文

下载文献需遵守相关版权规定

论文标题