Audio-Visual Source Localization (AVSL) is the task of identifying specific
sounding objects in the scene given audio cues. In our work, we focus on
semi-supervised AVSL with pseudo-labeling. To address the issues with vanilla
hard pseudo-labels including bias accumulation, noise sensitivity, and
instability, we propose a novel method named Cross Pseudo-Labeling (XPL),
wherein two models learn from each other with the cross-refine mechanism to
avoid bias accumulation. We equip XPL with two effective components. Firstly,
the soft pseudo-labels with sharpening and pseudo-label exponential moving
average mechanisms enable models to achieve gradual self-improvement and ensure
stable training. Secondly, the curriculum data selection module adaptively
selects pseudo-labels with high quality during training to mitigate potential
bias. Experimental results demonstrate that XPL significantly outperforms
existing methods, achieving state-of-the-art performance while effectively
mitigating confirmation bias and ensuring training stability.

我们的研究聚焦于半监督 AVSL 中的伪标签方法，提出了一种名为 Cross Pseudo-Labeling（XPL）的新方法，通过交互学习和交叉精炼机制，避免偏见积累，并结合软伪标签和课程数据选择模块以实现稳定训练，实验证明 XPL 相较于现有方法在性能上显著优越，并在保持稳定性的同时有效减轻了确认偏见。

跨模态伪标签半监督音频 - 视觉源定位

Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source  Localization

Audio-Visual Source Localization (AVSL) aims to locate sounding objects
within video frames given the paired audio clips. Existing methods
predominantly rely on self-supervised contrastive learning of audio-visual
correspondence. Without any bounding-box annotations, they struggle to achieve
precise localization, especially for small objects, and suffer from blurry
boundaries and false positives. Moreover, the naive semi-supervised method is
poor in fully leveraging the information of abundant unlabeled data. In this
paper, we propose a novel semi-supervised learning framework for AVSL, namely
Dual Mean-Teacher (DMT), comprising two teacher-student structures to
circumvent the confirmation bias issue. Specifically, two teachers, pre-trained
on limited labeled data, are employed to filter out noisy samples via the
consensus between their predictions, and then generate high-quality
pseudo-labels by intersecting their confidence maps. The sufficient utilization
of both labeled and unlabeled data and the proposed unbiased framework enable
DMT to outperform current state-of-the-art methods by a large margin, with CIoU
of 90.4% and 48.8% on Flickr-SoundNet and VGG-Sound Source, obtaining 8.9%,
9.6% and 4.6%, 6.4% improvements over self- and semi-supervised methods
respectively, given only 3% positional-annotations. We also extend our
framework to some existing AVSL methods and consistently boost their
performance.

提出一种新的半监督学习框架，称为 Dual Mean-Teacher（DMT），通过两个教师 - 学生结构绕过确认偏差问题，充分利用有标签和无标签数据，通过教师之间的一致性过滤噪声样本并生成高质量的伪标签，从而在 Audio-Visual Source Localization（AVSL）中取得了明显优于当前先进方法的性能。

双重导师：一种无偏的音频 - 视觉源定位半监督框架

Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for  Audio-Visual Source Localization

Self-supervised audio-visual source localization aims to locate sound-source
objects in video frames without extra annotations. Recent methods often
approach this goal with the help of contrastive learning, which assumes only
the audio and visual contents from the same video are positive samples for each
other. However, this assumption would suffer from false negative samples in
real-world training. For example, for an audio sample, treating the frames from
the same audio class as negative samples may mislead the model and therefore
harm the learned representations e.g., the audio of a siren wailing may
reasonably correspond to the ambulances in multiple images). Based on this
observation, we propose a new learning strategy named False Negative Aware
Contrastive (FNAC) to mitigate the problem of misleading the training with such
false negative samples. Specifically, we utilize the intra-modal similarities
to identify potentially similar samples and construct corresponding adjacency
matrices to guide contrastive learning. Further, we propose to strengthen the
role of true negative samples by explicitly leveraging the visual features of
sound sources to facilitate the differentiation of authentic sounding source
regions. FNAC achieves state-of-the-art performances on Flickr-SoundNet,
VGG-Sound, and AVSBench, which demonstrates the effectiveness of our method in
mitigating the false negative issue. The code is available at
https://github.com/OpenNLPLab/FNAC_AVL.

本研究提出了一种新的自监督音视频源定位学习策略，名为 False Negative Aware Contrastive（FNAC），旨在缓解真实世界训练中的错误负样本问题。该方法基于对单模态相似性的利用，可以识别类似样本并构建相应的邻接矩阵来引导对比学习。进一步地，该方法通过显式地利用音源的视觉特征，以区分真实的声源区域，增强了真负样本的作用，从而取得了 Flickr-SoundNet、VGG-Sound 和 AVSBench 中的最先进表现。

通过假阴性感知对比学习学习音频 - 视觉源定位

Learning Audio-Visual Source Localization via False Negative Aware  Contrastive Learning

Audio-visual source localization is a challenging task that aims to predict
the location of visual sound sources in a video. Since collecting ground-truth
annotations of sounding objects can be costly, a plethora of weakly-supervised
localization methods that can learn from datasets with no bounding-box
annotations have been proposed in recent years, by leveraging the natural
co-occurrence of audio and visual signals. Despite significant interest,
popular evaluation protocols have two major flaws. First, they allow for the
use of a fully annotated dataset to perform early stopping, thus significantly
increasing the annotation effort required for training. Second, current
evaluation metrics assume the presence of sound sources at all times. This is
of course an unrealistic assumption, and thus better metrics are necessary to
capture the model's performance on (negative) samples with no visible sound
sources. To accomplish this, we extend the test set of popular benchmarks,
Flickr SoundNet and VGG-Sound Sources, in order to include negative samples,
and measure performance using metrics that balance localization accuracy and
recall. Using the new protocol, we conducted an extensive evaluation of prior
methods, and found that most prior works are not capable of identifying
negatives and suffer from significant overfitting problems (rely heavily on
early stopping for best results). We also propose a new approach for visual
sound source localization that addresses both these problems. In particular, we
found that, through extreme visual dropout and the use of momentum encoders,
the proposed approach combats overfitting effectively, and establishes a new
state-of-the-art performance on both Flickr SoundNet and VGG-Sound Source. Code
and pre-trained models are available at this https URL

本文提出了一种新的视听源定位方法，通过扩展音频图片嵌入的训练数据以及采用新的评估方法来解决定位不准确和过拟合的问题。