Active speaker detection (ASD) systems are important modules for analyzing
multi-talker conversations. They aim to detect which speakers or none are
talking in a visual scene at any given time. Existing research on ASD does not
agree on the definition of active speakers. We clarify the definition in this
work and require synchronization between the audio and visual speaking
activities. This clarification of definition is motivated by our extensive
experiments, through which we discover that existing ASD methods fail in
modeling the audio-visual synchronization and often classify unsynchronized
videos as active speaking. To address this problem, we propose a cross-modal
contrastive learning strategy and apply positional encoding in attention
modules for supervised ASD models to leverage the synchronization cue.
Experimental results suggest that our model can successfully detect
unsynchronized speaking as not speaking, addressing the limitation of current
models.

本文提出一种跨模态对比学习策略，并在注意力模块中应用位置编码来识别音频和视频之间的同步信号，解决现有 ASD 方法不能识别异步视频导致误报的问题。实验结果表明该方法成功检测到非同步说话，解决了当前模型的局限性。