Sounds provide rich semantics, complementary to visual data, for many tasks. However, in practice, sounds from multiple sources are often mixed together. In this paper we propose a novel framework, referred to as MinusPlus Network (MP-Net), for the task of visual sound separation. MP-Net separates sounds recursively in the order of average energy, removing the separated sound from the mixture at the end of each prediction, until the mixture becomes empty or contains only noise. In this way, MP-Net could be applied to sound mixtures with arbitrary numbers and types of sounds. Moreover, while MP-Net keeps removing sounds with large energy from the mixture, sounds with small energy could emerge and become clearer, so that the separation is more accurate. Compared to previous methods, MP-Net obtains state-of-the-art results on two large scale datasets, across mixtures with different types and numbers of sounds.

本篇论文提出了一种名为MinusPlus Network (MP-Net)的新型框架，用于视听分离任务。MP-Net按平均能量的顺序递归地分离声音，并将分离后的声音从混合物中移除，直到混合物为空或仅包含噪声。通过这种方式，MP-Net可以应用于具有任意数量和类型声音的混音中，并且相较于以前的方法取得了最先进的结果。

使用 正负网络 递归进行视觉声音分离

使用正负网络递归进行视觉声音分离