In video-based emotion recognition (ER), it is important to effectively
leverage the complementary relationship among audio (A) and visual (V)
modalities, while retaining the intra-modal characteristics of individual
modalities. In this paper, a recursive joint attention model is proposed along
with long short-term memory (LSTM) modules for the fusion of vocal and facial
expressions in regression-based ER. Specifically, we investigated the
possibility of exploiting the complementary nature of A and V modalities using
a joint cross-attention model in a recursive fashion with LSTMs to capture the
intra-modal temporal dependencies within the same modalities as well as among
the A-V feature representations. By integrating LSTMs with recursive joint
cross-attention, our model can efficiently leverage both intra- and inter-modal
relationships for the fusion of A and V modalities. The results of extensive
experiments performed on the challenging Affwild2 and Fatigue (private)
datasets indicate that the proposed A-V fusion model can significantly
outperform state-of-art-methods.

本文提出了一种递归联合注意模型，结合长短期记忆模块，用于融合语音和面部表情进行基于回归的情感识别，结果表明该模型比现有技术表现更好。

基于回归的情感识别中的音视频融合中的递归联合注意力

Recursive Joint Attention for Audio-Visual Fusion in Regression based  Emotion Recognition

While accurate lip synchronization has been achieved for arbitrary-subject
audio-driven talking face generation, the problem of how to efficiently drive
the head pose remains. Previous methods rely on pre-estimated structural
information such as landmarks and 3D parameters, aiming to generate
personalized rhythmic movements. However, the inaccuracy of such estimated
information under extreme conditions would lead to degradation problems. In
this paper, we propose a clean yet effective framework to generate
pose-controllable talking faces. We operate on raw face images, using only a
single photo as an identity reference. The key is to modularize audio-visual
representations by devising an implicit low-dimension pose code. Substantially,
both speech content and head pose information lie in a joint non-identity
embedding space. While speech content information can be defined by learning
the intrinsic synchronization between audio-visual modalities, we identify that
a pose code will be complementarily learned in a modulated convolution-based
reconstruction framework.
Extensive experiments show that our method generates accurately lip-synced
talking faces whose poses are controllable by other videos. Moreover, our model
has multiple advanced capabilities including extreme view robustness and
talking face frontalization. Code, models, and demo videos are available at
this https URL

本文提出了一种简洁而有效的框架来生成姿势可控的对话脸，通过使用隐式低维姿势代码对原始面部图像进行操作，实现语音和头部姿势信息的联合非身份嵌入空间，通过调制卷积重建框架，在极端视角稳健的情况下生成准确的唇形同步会话，并具有多种先进功能，例如对话面部前视。

隐式模块化音视表示的姿态可控说话面孔生成

Pose-Controllable Talking Face Generation by Implicitly Modularized  Audio-Visual Representation

We present a learning-based method for detecting real and fake deepfake
multimedia content. To maximize information for learning, we extract and
analyze the similarity between the two audio and visual modalities from within
the same video. Additionally, we extract and compare affective cues
corresponding to perceived emotion from the two modalities within a video to
infer whether the input video is "real" or "fake". We propose a deep learning
network, inspired by the Siamese network architecture and the triplet loss. To
validate our model, we report the AUC metric on two large-scale deepfake
detection datasets, DeepFake-TIMIT Dataset and DFDC. We compare our approach
with several SOTA deepfake detection methods and report per-video AUC of 84.4%
on the DFDC and 96.6% on the DF-TIMIT datasets, respectively. To the best of
our knowledge, ours is the first approach that simultaneously exploits audio
and video modalities and also perceived emotions from the two modalities for
deepfake detection.

本文提出了一种基于学习的方法来检测真实和虚假的 deepfake 多媒体内容，通过提取和分析同一视频中两种音频和视觉模态之间的相似性，以及提取和比较情感线索来推断输入视频是 “真实” 还是 “虚假”，并提出了一种深度学习网络，同时利用音频和视频模态以及两种模态的感知情绪进行 deepfake 检测，实验结果表明，本文方法在 DeepFake-TIMIT 数据集和 DFDC 数据集上分别达到了 84.4％和 96.6％的 AUC，是首个同时利用音频和视觉模态以及两种模态的感知情绪进行 deepfake 检测的方法。