细粒度图像用剪辑奖励字幕

论文标题

细粒度图像用剪辑奖励字幕

Fine-grained Image Captioning with CLIP Reward

论文作者

Cho, Jaemin, Yoon, Seunghyun, Kale, Ajinkya, Dernoncourt, Franck, Bui, Trung, Bansal, Mohit

论文摘要

现代图像字幕模型通常经过文本相似目标训练。但是，由于公共数据集中的参考字幕通常描述了最显着的共同对象，因此接受文本相似性目标训练的模型倾向于忽略图像的特定和详细方面，将其与众不同。为了更具描述性和独特的字幕生成，我们建议使用Clip，这是一种在Web的巨大图像文本对中训练的多模式编码器，以计算多模式相似性并将其用作奖励功能。我们还提出了剪辑文本编码器的简单填充策略，以改善不需要额外文本注释的语法。这完全消除了奖励计算期间的参考标题的需求。为了全面评估描述性标题，我们介绍了FineCapeval，这是一个用于标题评估的新数据集，具有细粒度标准：总体，背景，对象，关系。在我们对文本到图像检索和罚款的实验中，提出的夹子引导的模型比苹果酒优化的模型产生更独特的字幕。我们还表明，我们无监督的剪辑文本编码语法填充减轻了幼稚剪辑奖励的变性问题。最后，我们展示了人类分析，根据各种标准，注释者强烈希望剪辑奖励对苹果酒和MLE目标。代码和数据：https：//github.com/j-min/clip-caption-ward

Modern image captioning models are usually trained with text similarity objectives. However, since reference captions in public datasets often describe the most salient common objects, models trained with text similarity objectives tend to ignore specific and detailed aspects of an image that distinguish it from others. Toward more descriptive and distinctive caption generation, we propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function. We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation. This completely eliminates the need for reference captions during the reward computation. To comprehensively evaluate descriptive captions, we introduce FineCapEval, a new dataset for caption evaluation with fine-grained criteria: overall, background, object, relations. In our experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model. We also show that our unsupervised grammar finetuning of the CLIP text encoder alleviates the degeneration problem of the naive CLIP reward. Lastly, we show human analysis where the annotators strongly prefer the CLIP reward to the CIDEr and MLE objectives according to various criteria. Code and Data: https://github.com/j-min/CLIP-Caption-Reward

下载PDF全文

下载文献需遵守相关版权规定

论文标题