Large-scale pre-trained image-text models demonstrate remarkable versatility
across diverse tasks, benefiting from their robust representational
capabilities and effective multimodal alignment. We extend the application of
these models, specifically CLIP, to the domain of sound source localization.
Unlike conventional approaches, we employ the pre-trained CLIP model without
explicit text input, relying solely on the audio-visual correspondence. To this
end, we introduce a framework that translates audio signals into tokens
compatible with CLIP's text encoder, yielding audio-driven embeddings. By
directly using these embeddings, our method generates audio-grounded masks for
the provided audio, extracts audio-grounded image features from the highlighted
regions, and aligns them with the audio-driven embeddings using the
audio-visual correspondence objective. Our findings suggest that utilizing
pre-trained image-text models enable our model to generate more complete and
compact localization maps for the sounding objects. Extensive experiments show
that our method outperforms state-of-the-art approaches by a significant
margin.

利用大规模预训练的图像 - 文本模型对声源定位进行了扩展，通过音频信号与图像的对应关系，生成音频驱动的嵌入向量，以此对提供的音频生成驱动遮罩，并提取高亮区域的音频驱动图像特征，并与音频驱动的嵌入向量进行对齐，实现声音对象的更完整和更紧凑的定位图。广泛实验证明，该方法在表现上优于最先进的方法。