We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework
that solves the audio-visual sound source separation task through a generative
manner. While existing discriminative methods that perform mask regression have
made remarkable progress in this field, they face limitations in capturing the
complex data distribution required for high-quality separation of sounds from
diverse categories. In contrast, DAVIS leverages a generative diffusion model
and a Separation U-Net to synthesize separated magnitudes starting from
Gaussian noises, conditioned on both the audio mixture and the visual footage.
With its generative objective, DAVIS is better suited to achieving the goal of
high-quality sound separation across diverse categories. We compare DAVIS to
existing state-of-the-art discriminative audio-visual separation methods on the
domain-specific MUSIC dataset and the open-domain AVE dataset, and results show
that DAVIS outperforms other methods in separation quality, demonstrating the
advantages of our framework for tackling the audio-visual source separation
task.

我们提出了 DAVIS，一种基于扩散模型的音频 - 视觉分离框架，通过生成的方式解决音频 - 视觉声源分离任务。与现有的判别方法相比，DAVIS 利用生成性扩散模型和 Separation U-Net 从高斯噪声开始合成分离后的幅度，以实现在各种类别中高质量声音分离的目标。我们在特定领域的 MUSIC 数据集和开放领域的 AVE 数据集上将 DAVIS 与现有的最先进的判别式音频 - 视觉分离方法进行比较，结果表明 DAVIS 在分离质量方面优于其他方法，展示了我们的框架在处理音频 - 视觉源分离任务上的优势。

DAVIS: 高质量的音频视觉分离与生成扩散模型

DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion  Models

In recent years, the task of segmenting foreground objects from background in
a video, i.e. video object segmentation (VOS), has received considerable
attention. In this paper, we propose a single end-to-end trainable deep neural
network, convolutional gated recurrent Mask-RCNN, for tackling the
semi-supervised VOS task. We take advantage of both the instance segmentation
network (Mask-RCNN) and the visual memory module (Conv-GRU) to tackle the VOS
task. The instance segmentation network predicts masks for instances, while the
visual memory module learns to selectively propagate information for multiple
instances simultaneously, which handles the appearance change, the variation of
scale and pose and the occlusions between objects. After offline and online
training under purely instance segmentation losses, our approach is able to
achieve satisfactory results without any post-processing or synthetic video
data augmentation. Experimental results on DAVIS 2016 dataset and DAVIS 2017
dataset have demonstrated the effectiveness of our method for video object
segmentation task.

本研究提出了一种端到端的深度神经网络，结合了 Mask-RCNN 实例分割网络和 Conv-GRU 视觉记忆模块，用于解决半监督视频对象分割任务，实验结果表明该方法在 DAVIS 数据集上取得了令人满意的结果。