你真的是说吗？内容驱动的视听式深击数据集和用于时间伪造的多模式方法

论文标题

你真的是说吗？内容驱动的视听式深击数据集和用于时间伪造的多模式方法

Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization

论文作者

Cai, Zhixi, Stefanov, Kalin, Dhall, Abhinav, Hayat, Munawar

论文摘要

由于其具有很大的社会影响，DeepFake检测正在在计算机视觉社区中积极关注。大多数DeepFake检测方法都依赖于整个视频或随机位置的基于基于扰动的时空修饰的身份，面部属性和对抗性扰动的时空修改，同时保持内容完整的含义。但是，复杂的深泡可能只包含一小部分的视频/音频操纵，例如，从情感的角度来看，内容的含义可以完全反转。我们介绍了一个由内容驱动的音频访问DeepFake数据集，该数据集称为局部视听深板（LAV-DF），该数据集明确设计用于学习时间伪造的任务。具体而言，从战略上进行了以内容驱动的视听操作来改变整个视频的情感极性。我们用于基准测试数据集的基线方法是3DCNN模型，称为边界意识到的时间伪造检测（BA-TFD），该模型通过对比度，边界匹配和框架分类损失函数进行引导。我们广泛的定量和定性分析表明，该方法在时间伪造的定位和深层检测任务方面提出了强大的性能。

Due to its high societal impact, deepfake detection is getting active attention in the computer vision community. Most deepfake detection methods rely on identity, facial attributes, and adversarial perturbation-based spatio-temporal modifications at the whole video or random locations while keeping the meaning of the content intact. However, a sophisticated deepfake may contain only a small segment of video/audio manipulation, through which the meaning of the content can be, for example, completely inverted from a sentiment perspective. We introduce a content-driven audio-visual deepfake dataset, termed Localized Audio Visual DeepFake (LAV-DF), explicitly designed for the task of learning temporal forgery localization. Specifically, the content-driven audio-visual manipulations are performed strategically to change the sentiment polarity of the whole video. Our baseline method for benchmarking the proposed dataset is a 3DCNN model, termed as Boundary Aware Temporal Forgery Detection (BA-TFD), which is guided via contrastive, boundary matching, and frame classification loss functions. Our extensive quantitative and qualitative analysis demonstrates the proposed method's strong performance for temporal forgery localization and deepfake detection tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题