Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video clips, but this puts unwieldy restrictions on training data collection and may even prevent learning the properties of "true" mixed sounds. We introduce a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos. Our novel training objective requires that the deep neural network's separated audio for similar-looking objects be consistently identifiable, while simultaneously reproducing accurate video-level audio tracks for each source training pair. Our approach disentangles sounds in realistic test videos, even in cases where an object was not observed individually during training. We obtain state-of-the-art results on visually-guided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets. Our video results: http://vision.cs.utexas.edu/projects/coseparation/

本文提出了一种共分离训练范式，可以从未标记的多源视频中学习对象级别的声音，通过新颖的训练目标，训练出深度神经网络的分离音频对于外观相似的对象具有一致性可识别的特性，从而在音频源分离和降噪方面获得了最先进的结果。

视觉对象声音的分离