Universal source separation targets at separating the audio sources of an arbitrary mix, removing the constraint to operate on a specific domain like speech or music. Yet, the potential of universal source separation is limited because most existing works focus on mixes with predominantly sound events, and small training datasets also limit its potential for supervised learning. Here, we study a single general audio source separation (GASS) model trained to separate speech, music, and sound events in a supervised fashion with a large-scale dataset. We assess GASS models on a diverse set of tasks. Our strong in-distribution results show the feasibility of GASS models, and the competitive out-of-distribution performance in sound event and speech separation shows its generalization abilities. Yet, it is challenging for GASS models to generalize for separating out-of-distribution cinematic and music content. We also fine-tune GASS models on each dataset and consistently outperform the ones without pre-training. All fine-tuned models (except the music separation one) obtain state-of-the-art results in their respective benchmarks.

通用音频源分离旨在分离任意混音的音频源，无需特定领域（如语音或音乐），但其潜力受到限制，因为大多数现有研究关注主要是声音事件的混音，并且较小的训练数据集也限制了其监督学习的潜力。在这里，我们研究了一种单一的通用音频源分离（GASS）模型，它在大规模数据集上以监督方式训练以分离语音、音乐和声音事件。我们对GASS模型进行了多样的任务评估。我们的强可分离性结果显示了GASS模型的可行性，声音事件和语音分离的竞争性跨领域性能表明了其泛化能力。然而，GASS模型在电影和音乐内容的跨领域分离方面具有挑战性。我们还对每个数据集进行了GASS模型的微调，并在各自的基准测试中始终优于未经预训练的模型。除音乐分离外，所有微调模型均获得了其各自基准测试中的最先进结果。

GASS：使用大规模数据进行音频源分离泛化