Multimodal manipulations (also known as audio-visual deepfakes) make it
difficult for unimodal deepfake detectors to detect forgeries in multimedia
content. To avoid the spread of false propaganda and fake news, timely
detection is crucial. The damage to either modality (i.e., visual or audio) can
only be discovered through multi-modal models that can exploit both pieces of
information simultaneously. Previous methods mainly adopt uni-modal video
forensics and use supervised pre-training for forgery detection. This study
proposes a new method based on a multi-modal self-supervised-learning (SSL)
feature extractor to exploit inconsistency between audio and visual modalities
for multi-modal video forgery detection. We use the transformer-based SSL
pre-trained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic
feature extractor and a multi-scale temporal convolutional neural network to
capture the temporal correlation between the audio and visual modalities. Since
AV-HuBERT only extracts visual features from the lip region, we also adopt
another transformer-based video model to exploit facial features and capture
spatial and temporal artifacts caused during the deepfake generation process.
Experimental results show that our model outperforms all existing models and
achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT
datasets.

基于多模态自我监督学习（SSL）特征提取器的多模态视频伪造检测方法，利用音频和视觉模态之间的不一致性来提取视觉和声学特征，并通过多尺度时间卷积神经网络捕捉音频和视觉模态之间的时序相关性，实验结果表明我们的模型在 FakeAVCeleb 和 DeepfakeTIMIT 数据集上表现出更好的性能。

AV-Lip-Sync+: 利用 AV-HuBERT 揭示多模态不一致性的视频深度伪造检测

AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency  for Video Deepfake Detection

Most deepfake detection methods focus on detecting spatial and/or
spatio-temporal changes in facial attributes. This is because available
benchmark datasets contain mostly visual-only modifications. However, a
sophisticated deepfake may include small segments of audio or audio-visual
manipulations that can completely change the meaning of the content. To
addresses this gap, we propose and benchmark a new dataset, Localized Audio
Visual DeepFake (LAV-DF), consisting of strategic content-driven audio, visual
and audio-visual manipulations. The proposed baseline method, Boundary Aware
Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based
architecture which efficiently captures multimodal manipulations. We further
improve (i.e. BA-TFD+) the baseline method by replacing the backbone with a
Multiscale Vision Transformer and guide the training process with contrastive,
frame classification, boundary matching and multimodal boundary matching loss
functions. The quantitative analysis demonstrates the superiority of BA- TFD+
on temporal forgery localization and deepfake detection tasks using several
benchmark datasets including our newly proposed dataset. The dataset, models
and code are available at this https URL

本文提出了一种用于检测 Deepfake 的方法，它通过提出包含多种模式的策略内容驱动音频、视觉和音视频混合进行识别。并通过定量分析证明了 BA-TFD + 算法再 Deepfake 检测方面的优越性。