We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework
that solves the audio-visual sound source separation task through a generative
manner. While existing discriminative methods that perform mask regression have
made remarkable progress in this field, they face limitations in capturing the
complex data distribution required for high-quality separation of sounds from
diverse categories. In contrast, DAVIS leverages a generative diffusion model
and a Separation U-Net to synthesize separated magnitudes starting from
Gaussian noises, conditioned on both the audio mixture and the visual footage.
With its generative objective, DAVIS is better suited to achieving the goal of
high-quality sound separation across diverse categories. We compare DAVIS to
existing state-of-the-art discriminative audio-visual separation methods on the
domain-specific MUSIC dataset and the open-domain AVE dataset, and results show
that DAVIS outperforms other methods in separation quality, demonstrating the
advantages of our framework for tackling the audio-visual source separation
task.

我们提出了 DAVIS，一种基于扩散模型的音频 - 视觉分离框架，通过生成的方式解决音频 - 视觉声源分离任务。与现有的判别方法相比，DAVIS 利用生成性扩散模型和 Separation U-Net 从高斯噪声开始合成分离后的幅度，以实现在各种类别中高质量声音分离的目标。我们在特定领域的 MUSIC 数据集和开放领域的 AVE 数据集上将 DAVIS 与现有的最先进的判别式音频 - 视觉分离方法进行比较，结果表明 DAVIS 在分离质量方面优于其他方法，展示了我们的框架在处理音频 - 视觉源分离任务上的优势。

DAVIS: 高质量的音频视觉分离与生成扩散模型

DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion  Models

Can we perform an end-to-end music source separation with a variable number
of sources using a deep learning model? We present an extension of the
Wave-U-Net model which allows end-to-end monaural source separation with a
non-fixed number of sources. Furthermore, we propose multiplicative
conditioning with instrument labels at the bottleneck of the Wave-U-Net and
show its effect on the separation results. This approach leads to other types
of conditioning such as audio-visual source separation and score-informed
source separation.

本研究提出一种扩展的 Wave-U-Net 模型，通过可变数量源的端到端音乐源分离方法，并在瓶颈处使用仪器标签进行乘性调节，从而提高了分离结果，在此基础上实现了其他类型的调节，如音视频源分离和得分通知源分离。