Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown whether performance on speech tasks carries over to non-speech tasks. To study this question, we develop a universal dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. These network architectures include convolutional long short-term memory networks and time-dilated convolution stacks inspired by the recent success of time-domain enhancement networks like ConvTasNet. For the latter architecture, we also propose novel modifications that further improve separation performance. In terms of the framewise analysis-synthesis basis, we explore using either a short-time Fourier transform (STFT) or a learnable basis, as used in ConvTasNet, and for both of these bases, we examine the effect of window size. In particular, for STFTs, we find that longer windows (25-50 ms) work best for speech/non-speech separation, while shorter windows (2.5 ms) work best for arbitrary sounds. For learnable bases, shorter windows (2.5 ms) work best on all tasks. Surprisingly, for universal sound separation, STFTs outperform learnable bases. Our best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.

该研究使用深度学习模型进行了基于掩蔽的语音信号增强和分离任务的研究，并尝试将其应用到任意类型混音的分离任务中，即通用声音分离。在此过程中，作者比较了不同的分析合成基础和网络结构，其中长短时记忆网络和时延卷积堆栈是采用时间域增强网络（ConvTasNet）的架构，对于后者，作者还提出了一些新的改进方法来进一步提高分离性能。最后，作者的研究表明，短时傅立叶变换（STFT）在通用声音分离方面表现优异，而在语音/非语音分离方面，长窗口的STFT（25-50毫秒）效果明显好于短窗口（2.5毫秒），对于可学习的基础来说，短窗口（2.5毫秒）一直是最佳选择。作者的最佳 方法在语音/非语音分离和通用声音分离方面都取得了显著的信号失真比的提高。

通用音频分离