论文标题
编辑的媒体理解框架:有关视觉错误信息的意图和含义的推理
Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation
论文作者
论文摘要
从“深层”到欺骗的简单编辑,多模式的虚假信息是一个重要的社会问题。然而,与此同时,绝大多数媒体编辑都是无害的 - 例如被过滤的度假照片。这个示例与传播虚假信息的有害编辑之间的区别是意图之一。认识和描述这一意图是当今AI系统的主要挑战。 我们介绍了编辑的媒体理解的任务,要求模型回答捕获图像编辑意图和含义的开放式问题。我们介绍了一个用于我们的任务的数据集EMU,并用48k的问答对编写了丰富的自然语言。我们为我们的任务评估了各种各样的视觉和语言模型,并引入了一种新的鹈鹕,该模型基于预审预定的多峰表示的最新进展。我们的模型在我们的数据集中获得了有希望的结果,人类将其答案评为准确的40.35%。同时,仍然有很多工作要做 - 人类更喜欢93.56%的时间 - 我们提供分析,以突出以进一步进步的领域。
Multimodal disinformation, from 'deepfakes' to simple edits that deceive, is an important societal problem. Yet at the same time, the vast majority of media edits are harmless -- such as a filtered vacation photo. The difference between this example, and harmful edits that spread disinformation, is one of intent. Recognizing and describing this intent is a major challenge for today's AI systems. We present the task of Edited Media Understanding, requiring models to answer open-ended questions that capture the intent and implications of an image edit. We introduce a dataset for our task, EMU, with 48k question-answer pairs written in rich natural language. We evaluate a wide variety of vision-and-language models for our task, and introduce a new model PELICAN, which builds upon recent progress in pretrained multimodal representations. Our model obtains promising results on our dataset, with humans rating its answers as accurate 40.35% of the time. At the same time, there is still much work to be done -- humans prefer human-annotated captions 93.56% of the time -- and we provide analysis that highlights areas for further progress.