Large-scale pre-trained image-text models demonstrate remarkable versatility
across diverse tasks, benefiting from their robust representational
capabilities and effective multimodal alignment. We extend the application of
these models, specifically CLIP, to the domain of sound source localization.
Unlike conventional approaches, we employ the pre-trained CLIP model without
explicit text input, relying solely on the audio-visual correspondence. To this
end, we introduce a framework that translates audio signals into tokens
compatible with CLIP's text encoder, yielding audio-driven embeddings. By
directly using these embeddings, our method generates audio-grounded masks for
the provided audio, extracts audio-grounded image features from the highlighted
regions, and aligns them with the audio-driven embeddings using the
audio-visual correspondence objective. Our findings suggest that utilizing
pre-trained image-text models enable our model to generate more complete and
compact localization maps for the sounding objects. Extensive experiments show
that our method outperforms state-of-the-art approaches by a significant
margin.

利用大规模预训练的图像 - 文本模型对声源定位进行了扩展，通过音频信号与图像的对应关系，生成音频驱动的嵌入向量，以此对提供的音频生成驱动遮罩，并提取高亮区域的音频驱动图像特征，并与音频驱动的嵌入向量进行对齐，实现声音对象的更完整和更紧凑的定位图。广泛实验证明，该方法在表现上优于最先进的方法。

CLIP 能帮助声源定位吗？

Can CLIP Help Sound Source Localization?

Weakly Supervised Semantic Segmentation (WSSS) based on image-level labels
has attracted much attention due to low annotation costs. Existing methods
often rely on Class Activation Mapping (CAM) that measures the correlation
between image pixels and classifier weight. However, the classifier focuses
only on the discriminative regions while ignoring other useful information in
each image, resulting in incomplete localization maps. To address this issue,
we propose a Self-supervised Image-specific Prototype Exploration (SIPE) that
consists of an Image-specific Prototype Exploration (IPE) and a
General-Specific Consistency (GSC) loss. Specifically, IPE tailors prototypes
for every image to capture complete regions, formed our Image-Specific CAM
(IS-CAM), which is realized by two sequential steps. In addition, GSC is
proposed to construct the consistency of general CAM and our specific IS-CAM,
which further optimizes the feature representation and empowers a
self-correction ability of prototype exploration. Extensive experiments are
conducted on PASCAL VOC 2012 and MS COCO 2014 segmentation benchmark and
results show our SIPE achieves new state-of-the-art performance using only
image-level labels. The code is available at
this https URL

提出了一种基于自我监督的图像特定原型探索的弱监督语义分割方法，结合图像特定的类别激活映射和一致性损失，实现完整的区域捕捉和特征表示，取得了 PASCAL VOC 2012 和 MS COCO 2014 分割基准数据集上的最新的最佳表现，只需利用图像级标签即可。

自监督图像特定原型探索用于弱监督语义分割

Self-supervised Image-specific Prototype Exploration for Weakly  Supervised Semantic Segmentation

The main obstacle to weakly supervised semantic image segmentation is the
difficulty of obtaining pixel-level information from coarse image-level
annotations. Most methods based on image-level annotations use localization
maps obtained from the classifier, but these only focus on the small
discriminative parts of objects and do not capture precise boundaries.
FickleNet explores diverse combinations of locations on feature maps created by
generic deep neural networks. It selects hidden units randomly and then uses
them to obtain activation scores for image classification. FickleNet implicitly
learns the coherence of each location in the feature maps, resulting in a
localization map which identifies both discriminative and other parts of
objects. The ensemble effects are obtained from a single network by selecting
random hidden unit pairs, which means that a variety of localization maps are
generated from a single image. Our approach does not require any additional
training steps and only adds a simple layer to a standard convolutional neural
network; nevertheless it outperforms recent comparable techniques on the Pascal
VOC 2012 benchmark in both weakly and semi-supervised settings.

FickleNet 提出了一种基于神经网络的图像语义分割方法，通过随机的方式得到激活值，自适应地学习特征图上每个位置的相关性，从而得到既包含物体区分度，又准确的定位边界的 localization maps。该方法无需另行训练，并仅通过添加简单的层次到标准卷积神经网络中就能有效提高弱监督和半监督分割任务的性能。