编辑的媒体理解框架：有关视觉错误信息的意图和含义的推理

论文标题

编辑的媒体理解框架：有关视觉错误信息的意图和含义的推理

Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation

论文作者

Da, Jeff, Forbes, Maxwell, Zellers, Rowan, Zheng, Anthony, Hwang, Jena D., Bosselut, Antoine, Choi, Yejin

论文摘要

从“深层”到欺骗的简单编辑，多模式的虚假信息是一个重要的社会问题。然而，与此同时，绝大多数媒体编辑都是无害的 - 例如被过滤的度假照片。这个示例与传播虚假信息的有害编辑之间的区别是意图之一。认识和描述这一意图是当今AI系统的主要挑战。我们介绍了编辑的媒体理解的任务，要求模型回答捕获图像编辑意图和含义的开放式问题。我们介绍了一个用于我们的任务的数据集EMU，并用48k的问答对编写了丰富的自然语言。我们为我们的任务评估了各种各样的视觉和语言模型，并引入了一种新的鹈鹕，该模型基于预审预定的多峰表示的最新进展。我们的模型在我们的数据集中获得了有希望的结果，人类将其答案评为准确的40.35％。同时，仍然有很多工作要做 - 人类更喜欢93.56％的时间 - 我们提供分析，以突出以进一步进步的领域。

Multimodal disinformation, from 'deepfakes' to simple edits that deceive, is an important societal problem. Yet at the same time, the vast majority of media edits are harmless -- such as a filtered vacation photo. The difference between this example, and harmful edits that spread disinformation, is one of intent. Recognizing and describing this intent is a major challenge for today's AI systems. We present the task of Edited Media Understanding, requiring models to answer open-ended questions that capture the intent and implications of an image edit. We introduce a dataset for our task, EMU, with 48k question-answer pairs written in rich natural language. We evaluate a wide variety of vision-and-language models for our task, and introduce a new model PELICAN, which builds upon recent progress in pretrained multimodal representations. Our model obtains promising results on our dataset, with humans rating its answers as accurate 40.35% of the time. At the same time, there is still much work to be done -- humans prefer human-annotated captions 93.56% of the time -- and we provide analysis that highlights areas for further progress.

下载PDF全文

下载文献需遵守相关版权规定

论文标题