Mintime：多身份尺寸不变的视频DeepFake检测

论文标题

Mintime：多身份尺寸不变的视频DeepFake检测

MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection

论文作者

Coccomini, Davide Alessandro, Zilos, Giorgos Kordopatis, Amato, Giuseppe, Caldelli, Roberto, Falchi, Fabrizio, Papadopoulos, Symeon, Gennaro, Claudio

论文摘要

在本文中，我们介绍了Mintime，这是一种视频深击检测方法，该方法捕获了空间和时间异常，并处理了相同视频中多人的实例，并且面部尺寸的变化。以前的方法通过使用简单的A-posteriori聚合方案（即平均或最大操作）来忽略此类信息，或者仅对推理（即最大的）使用一种身份。相反，所提出的方法建立在时空时间形式的基础上，结合了卷积神经网络骨架，从视频中描述的多个身份的面部序列中捕获时空异常。这是通过一种认同意识的注意机制来实现的，该机制基于掩盖操作独立地参与每个面部序列，并促进视频级别的聚合。此外，还采用了两个新颖的嵌入：（i）编码每个面部序列的时间信息的时间相干位置嵌入，以及（ii）编码面部大小作为与视频框架大小的比例的嵌入尺寸。这些扩展使我们的系统可以通过学习如何汇总多种身份的信息来特别适应野外，这通常被文献中的其他方法忽略了。它在伪造网络数据集上实现了最先进的结果，在包含多个人的视频中提高了14％的AUC，并在跨性格和跨数据库设置中展示了足够的概括能力。该代码可在https://github.com/davide-coccomini/mintime-multi-sideity-size-invariant-times-for-video-video-deepfake-detection上公开获得。

In this paper, we introduce MINTIME, a video deepfake detection approach that captures spatial and temporal anomalies and handles instances of multiple people in the same video and variations in face sizes. Previous approaches disregard such information either by using simple a-posteriori aggregation schemes, i.e., average or max operation, or using only one identity for the inference, i.e., the largest one. On the contrary, the proposed approach builds on a Spatio-Temporal TimeSformer combined with a Convolutional Neural Network backbone to capture spatio-temporal anomalies from the face sequences of multiple identities depicted in a video. This is achieved through an Identity-aware Attention mechanism that attends to each face sequence independently based on a masking operation and facilitates video-level aggregation. In addition, two novel embeddings are employed: (i) the Temporal Coherent Positional Embedding that encodes each face sequence's temporal information and (ii) the Size Embedding that encodes the size of the faces as a ratio to the video frame size. These extensions allow our system to adapt particularly well in the wild by learning how to aggregate information of multiple identities, which is usually disregarded by other methods in the literature. It achieves state-of-the-art results on the ForgeryNet dataset with an improvement of up to 14% AUC in videos containing multiple people and demonstrates ample generalization capabilities in cross-forgery and cross-dataset settings. The code is publicly available at https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection.

下载PDF全文

下载文献需遵守相关版权规定

论文标题