We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner. While existing discriminative methods that perform mask regression have made remarkable progress in this field, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated magnitudes starting from Gaussian noises, conditioned on both the audio mixture and the visual footage. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the domain-specific MUSIC dataset and the open-domain AVE dataset, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task.

我们提出了DAVIS，一种基于扩散模型的音频-视觉分离框架，通过生成的方式解决音频-视觉声源分离任务。与现有的判别方法相比，DAVIS利用生成性扩散模型和Separation U-Net从高斯噪声开始合成分离后的幅度，以实现在各种类别中高质量声音分离的目标。我们在特定领域的MUSIC数据集和开放领域的AVE数据集上将DAVIS与现有的最先进的判别式音频-视觉分离方法进行比较，结果表明DAVIS在分离质量方面优于其他方法，展示了我们的框架在处理音频-视觉源分离任务上的优势。

DAVIS: 高质量的音频视觉分离与生成扩散模型