Object-based audio production requires the positional metadata to be defined
for each point-source object, including the key elements in the foreground of
the sound scene. In many media production use cases, both cameras and
microphones are employed to make recordings, and the human voice is often a key
element. In this research, we detect and locate the active speaker in the
video, facilitating the automatic extraction of the positional metadata of the
talker relative to the camera's reference frame. With the integration of the
visual modality, this study expands upon our previous investigation focused
solely on audio-based active speaker detection and localization. Our
experiments compare conventional audio-visual approaches for active speaker
detection that leverage monaural audio, our previous audio-only method that
leverages multichannel recordings from a microphone array, and a novel
audio-visual approach integrating vision and multichannel audio. We found the
role of the two modalities to complement each other. Multichannel audio,
overcoming the problem of visual occlusions, provides a double-digit reduction
in detection error compared to audio-visual methods with single-channel audio.
The combination of multichannel audio and vision further enhances spatial
accuracy, leading to a four-percentage point increase in F1 score on the Tragic
Talkers dataset. Future investigations will assess the robustness of the model
in noisy and highly reverberant environments, as well as tackle the problem of
off-screen speakers.

通过使用多通道音频和视觉模式，本研究比较了传统的音频 - 视觉方法和单声道音频的活跃说话者检测方法，在位置元数据提取和空间准确性上取得了显著改进。未来的研究将评估该模型在嘈杂和高混响环境中的稳健性，并解决离屏说话者的问题。

视频中音频 - 视觉讲话者定位对空间音效重现的应用

Audio-Visual Talker Localization in Video for Spatial Sound Reproduction

This paper presents an audio-visual approach for voice separation which
produces state-of-the-art results at a low latency in two scenarios: speech and
singing voice. The model is based on a two-stage network. Motion cues are
obtained with a lightweight graph convolutional network that processes face
landmarks. Then, both audio and motion features are fed to an audio-visual
transformer which produces a fairly good estimation of the isolated target
source. In a second stage, the predominant voice is enhanced with an audio-only
network. We present different ablation studies and comparison to
state-of-the-art methods. Finally, we explore the transferability of models
trained for speech separation in the task of singing voice separation. The
demos, code, and weights are available in this https URL

本文提出了一种音频 - 视觉声音分离方案，在两种不同场景（语音和唱歌）中实现了低时延的最新成果。该模型基于两级网络，采用轻量级图卷积网络从面部标记中提取运动线索，然后将视觉和音频特征输入到音频 - 视觉转换器中，为目标源的隔离估计提供相当不错的结果。在第二阶段，利用音频网络增强了主要的声音。我们进行了不同的消融研究和与最先进的方法比较。最后，我们探讨了在唱声分离任务中训练语音分离模型的可转移性。