Due to the rise in video content creation targeted towards children, there is
a need for robust content moderation schemes for video hosting platforms. A
video that is visually benign may include audio content that is inappropriate
for young children while being impossible to detect with a unimodal content
moderation system. Popular video hosting platforms for children such as YouTube
Kids still publish videos which contain audio content that is not conducive to
a child's healthy behavioral and physical development. A robust classification
of malicious videos requires audio representations in addition to video
features. However, recent content moderation approaches rarely employ
multimodal architectures that explicitly consider non-speech audio cues. To
address this, we present an efficient adaptation of CLIP (Contrastive
Language-Image Pre-training) that can leverage contextual audio cues for
enhanced content moderation. We incorporate 1) the audio modality and 2) prompt
learning, while keeping the backbone modules of each modality frozen. We
conduct our experiments on a multimodal version of the MOB (Malicious or
Benign) dataset in supervised and few-shot settings.

针对面向儿童的视频内容创作数量的增加，需要强大的视频托管平台内容审核方案。我们提出了一种有效的 CLIP 适应方法，利用上下文音频提示来增强内容审核，通过冻结各个模态的背景模块，将音频模态和提示学习融合，对多模态版本的恶意或良性数据集进行了实验。

音视融合技术增强儿童视频多模态内容审核

Enhanced Multimodal Content Moderation of Children's Videos using  Audiovisual Fusion

In this work, we tackle the challenge of enhancing the realism and
expressiveness in talking head video generation by focusing on the dynamic and
nuanced relationship between audio cues and facial movements. We identify the
limitations of traditional techniques that often fail to capture the full
spectrum of human expressions and the uniqueness of individual facial styles.
To address these issues, we propose EMO, a novel framework that utilizes a
direct audio-to-video synthesis approach, bypassing the need for intermediate
3D models or facial landmarks. Our method ensures seamless frame transitions
and consistent identity preservation throughout the video, resulting in highly
expressive and lifelike animations. Experimental results demonsrate that EMO is
able to produce not only convincing speaking videos but also singing videos in
various styles, significantly outperforming existing state-of-the-art
methodologies in terms of expressiveness and realism.

在这项工作中，我们通过关注音频线索和面部动作之间的动态和微妙关系，解决了提高说话视频生成中的真实感和表现力的挑战。我们确定了传统技术的局限性，常常无法捕捉到人类表情的全谱和个体面部风格的独特性。为了解决这些问题，我们提出了 EMO，这是一个新颖的框架，利用直接的音频到视频合成方法，绕过了中间的 3D 模型或面部标记的需求。我们的方法确保了平滑的帧过渡和在整个视频中的一致身份保护，从而产生高度表现力和逼真的动画。实验结果表明，EMO 不仅能够产生令人信服的说话视频，还能以各种风格生成唱歌视频，其在表现力和逼真度方面明显优于现有最先进的方法学。

EMO: 表情肖像的生命之躍 - 在弱條件下使用 Audio2Video 擴散模型生成有表情的肖像視頻

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with  Audio2Video Diffusion Model under Weak Conditions

Gaze following estimates gaze targets of in-scene person by understanding
human behavior and scene information. Existing methods usually analyze scene
images for gaze following. However, compared with visual images, audio also
provides crucial cues for determining human behavior.This suggests that we can
further improve gaze following considering audio cues. In this paper, we
explore gaze following tasks in conversational scenarios. We propose a novel
multi-modal gaze following framework based on our observation ``audiences tend
to focus on the speaker''. We first leverage the correlation between audio and
lips, and classify speakers and listeners in a scene. We then use the identity
information to enhance scene images and propose a gaze candidate estimation
network. The network estimates gaze candidates from enhanced scene images and
we use MLP to match subjects with candidates as classification tasks. Existing
gaze following datasets focus on visual images while ignore audios.To evaluate
our method, we collect a conversational dataset, VideoGazeSpeech (VGS), which
is the first gaze following dataset including images and audio. Our method
significantly outperforms existing methods in VGS datasets. The visualization
result also prove the advantage of audio cues in gaze following tasks. Our work
will inspire more researches in multi-modal gaze following estimation.

使用音频线索，本文在对话场景中提出了一种基于多模式的凝视追踪框架，利用音频与嘴唇之间的关联来增强场景图像并估计凝视候选者，采用多层感知机将主题与候选者进行匹配作为分类任务，通过引入图像和音频的对话数据集进行评估，表明我们的方法在凝视追踪任务中具有显著优势，并促进了多模式凝视追踪估计的更多研究。