Multi-modal fusion is proven to be an effective method to improve the
accuracy and robustness of speaker tracking, especially in complex scenarios.
However, how to combine the heterogeneous information and exploit the
complementarity of multi-modal signals remains a challenging issue. In this
paper, we propose a novel Multi-modal Perception Tracker (MPT) for speaker
tracking using both audio and visual modalities. Specifically, a novel acoustic
map based on spatial-temporal Global Coherence Field (stGCF) is first
constructed for heterogeneous signal fusion, which employs a camera model to
map audio cues to the localization space consistent with the visual cues. Then
a multi-modal perception attention network is introduced to derive the
perception weights that measure the reliability and effectiveness of
intermittent audio and video streams disturbed by noise. Moreover, a unique
cross-modal self-supervised learning method is presented to model the
confidence of audio and visual observations by leveraging the complementarity
and consistency between different modalities. Experimental results show that
the proposed MPT achieves 98.6% and 78.3% tracking accuracy on the standard and
occluded datasets, respectively, which demonstrates its robustness under
adverse conditions and outperforms the current state-of-the-art methods.

本文提出了一种利用声音和视觉模态进行讲话者跟踪的多模态感知跟踪器（MPT），其中包括使用基于空时全局相干字段（stGCF）的声学地图进行异构信号融合，引入多模态感知注意力网络来导出可靠性和效益的知觉权重，以及使用跨模态自我监督学习方法模拟不同模态之间的互补性和一致性。实验结果表明，所提出的 MPT 在标准数据集和遮挡数据集上分别达到了 98.6% 和 78.3% 的跟踪精度，证明了其在不利条件下的鲁棒性优于目前的最新技术。

带自监督学习的多模态感知注意力网络用于音视说话者追踪

Multi-Modal Perception Attention Network with Self-Supervised Learning  for Audio-Visual Speaker Tracking

In this paper we address the problem of tracking multiple speakers via the
fusion of visual and auditory information. We propose to exploit the
complementary nature of these two modalities in order to accurately estimate
smooth trajectories of the tracked persons, to deal with the partial or total
absence of one of the modalities over short periods of time, and to estimate
the acoustic status -- either speaking or silent -- of each tracked person
along time. We propose to cast the problem at hand into a generative
audio-visual fusion (or association) model formulated as a latent-variable
temporal graphical model. This may well be viewed as the problem of maximizing
the posterior joint distribution of a set of continuous and discrete latent
variables given the past and current observations, which is intractable. We
propose a variational inference model which amounts to approximate the joint
distribution with a factorized distribution. The solution takes the form of a
closed-form expectation maximization procedure. We describe in detail the
inference algorithm, we evaluate its performance and we compare it with several
baseline methods. These experiments show that the proposed audio-visual tracker
performs well in informal meetings involving a time-varying number of people.

本文提出了一种基于视听信息融合技术框架的多说话人跟踪系统，利用可变因素推断方法近似求解了连续和离散潜变量的后验联合分布，实现了跟踪对象的平滑轨迹估计和说话状态的判断。实验结果表明该方法在非正式会议中表现出较好的性能。

用于多说话者音视频跟踪的变分贝叶斯推断

Variational Bayesian Inference for Audio-Visual Tracking of Multiple  Speakers

In this paper, we present a system that associates faces with voices in a
video by fusing information from the audio and visual signals. The thesis
underlying our work is that an extremely simple approach to generating (weak)
speech clusters can be combined with visual signals to effectively associate
faces and voices by aggregating statistics across a video. This approach does
not need any training data specific to this task and leverages the natural
coherence of information in the audio and visual streams. It is particularly
applicable to tracking speakers in videos on the web where a priori information
about the environment (e.g., number of speakers, spatial signals for
beamforming) is not available. We performed experiments on a real-world dataset
using this analysis framework to determine the speaker in a video. Given a
ground truth labeling determined by human rater consensus, our approach had
~71% accuracy.

本文提出了一种音视频关联系统，将音频和视觉信号的信息融合，通过对视频中的统计数据的聚合实现面部和声音的有效关联，无需针对此任务的具体训练数据，并利用音频和视觉流中信息的自然相干性，特别适用于跟踪网络视频中的说话人，通过对真实数据集的实验表明，该方法的准确率约为 71％。