VisualVoice: 跨模态一致性的音视频语音分离

Jan, 2021

VisualVoice: 跨模态一致性的音视频语音分离

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

Ruohan Gao, Kristen Grauman

TL;DR提出一种基于面部出现和声音特征对语音进行分离的方法，可对五种基准数据集进行音视频语音分离和增强，而且具有较好的泛化性能。

Abstract

We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers. Whereas existing methods focus on learning the alignment between the speaker's