The objective of this work is to localize sound sources that are visible in a
video without using manual annotations. Our key technical contribution is to
show that, by training the network to explicitly discriminate challenging image
fragments, even for images that do contain the object emitting the sound, we
can significantly boost the localization performance. We do so elegantly by
introducing a mechanism to mine hard samples and add them to a contrastive
learning formulation automatically. We show that our algorithm achieves
state-of-the-art performance on the popular Flickr SoundNet dataset.
Furthermore, we introduce the VGG-Sound Source (VGG-SS) benchmark, a new set of
annotations for the recently-introduced VGG-Sound dataset, where the sound
sources visible in each video clip are explicitly marked with bounding box
annotations. This dataset is 20 times larger than analogous existing ones,
contains 5K videos spanning over 200 categories, and, differently from Flickr
SoundNet, is video-based. On VGG-SS, we also show that our algorithm achieves
state-of-the-art performance against several baselines.

本文主要讲述了如何通过训练神经网络来定位视频中可见的声源，采用对图像难样本强化学习的方法以提升定位精度。同时，作者还介绍了一个新的数据集 VGG-Sound Source benchmark，并展示了该算法在其上的最先进性能。