Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based fusion technique for the efficient integration of audio and visual information, and a new mask separation approach that takes advantage of the intrinsic spectral nature of the acoustic features for a clearer separation. RTFS-Net outperforms the previous SOTA method using only 10% of the parameters and 18% of the MACs. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.

本文提出了一种新颖的基于时频域的音视频语音分离方法：递归时频分离网络(RTFS-Net)，通过在短时傅里叶变换产生的复杂时频区间上运用算法来独立地对音频的时间和频率进行建模，并引入了独特的基于注意力的融合技术，以有效地整合音频和视觉信息，并利用声学特征的固有谱特性进行更清晰的分离。RTFS-Net在仅使用10%的参数和18%的MAC时，超越了先前的最先进方法。这是首个在时频域中超越所有当代时域对应方法的音视频语音分离方法。

RTFS-Net: 循环时间频率建模 有效的音频视觉语音分离

RTFS-Net: 循环时间频率建模有效的音频视觉语音分离