Existing machine learning research has achieved promising results in monaural
audio-visual separation (MAVS). However, most MAVS methods purely consider what
the sound source is, not where it is located. This can be a problem in VR/AR
scenarios, where listeners need to be able to distinguish between similar audio
sources located in different directions. To address this limitation, we have
generalized MAVS to spatial audio separation and proposed LAVSS: a
location-guided audio-visual spatial audio separator. LAVSS is inspired by the
correlation between spatial audio and visual location. We introduce the phase
difference carried by binaural audio as spatial cues, and we utilize positional
representations of sounding objects as additional modality guidance. We also
leverage multi-level cross-modal attention to perform visual-positional
collaboration with audio features. In addition, we adopt a pre-trained monaural
separator to transfer knowledge from rich mono sounds to boost spatial audio
separation. This exploits the correlation between monaural and binaural
channels. Experiments on the FAIR-Play dataset demonstrate the superiority of
the proposed LAVSS over existing benchmarks of audio-visual separation. Our
project page: this https URL

现有的机器学习研究在单声道视听分离方面取得了令人期待的结果。然而，大多数视听分离方法只考虑声源是什么而不考虑其位置。这在虚拟实境 / 增强实境场景中可能成为一个问题，因为用户需要能够区分不同方向上的相似音频源。为解决这一限制，我们将视听分离推广到空间音频分离，并提出了一种基于位置引导的音频 - 视觉空间音频分离器 (LAVSS)。LAVSS 受到空间音频和视觉位置之间的相关性的启发。我们引入了双耳音频中包含的相位差作为空间线索，并利用发声对象的位置表示作为额外的模态指导。我们还采用多级跨模态注意力来进行视觉 - 位置的协作，并利用预训练的单声道分离器从丰富的单声道音频中转移知识以提高空间音频分离效果。在 FAIR-Play 数据集上的实验证明了所提出的 LAVSS 在视听分离方面的优越性。

基于位置引导的视听空间音频分离

LAVSS: Location-Guided Audio-Visual Spatial Audio Separation

We propose a self-supervised approach for learning to perform audio source
separation in videos based on natural language queries, using only unlabeled
video and audio pairs as training data. A key challenge in this task is
learning to associate the linguistic description of a sound-emitting object to
its visual features and the corresponding components of the audio waveform, all
without access to annotations during training. To overcome this challenge, we
adapt off-the-shelf vision-language foundation models to provide pseudo-target
supervision via two novel loss functions and encourage a stronger alignment
between the audio, visual and natural language modalities. During inference,
our approach can separate sounds given text, video and audio input, or given
text and audio input alone. We demonstrate the effectiveness of our
self-supervised approach on three audio-visual separation datasets, including
MUSIC, SOLOS and AudioSet, where we outperform state-of-the-art strongly
supervised approaches despite not using object detectors or text labels during
training.

利用自监督学习方法，通过自然语言查询基于无标注视频和音频对进行音频源分离的学习，以学习将声音发射对象的语言描述与其视觉特征和相应的音频波形组件相结合，其方法通过视觉 - 语言基础模型和两种新的损失函数提供伪目标监督，并在推理阶段能够分离声音，即使没有目标检测器或文本标签。