Existing machine learning research has achieved promising results in monaural audio-visual separation (MAVS). However, most MAVS methods purely consider what the sound source is, not where it is located. This can be a problem in VR/AR scenarios, where listeners need to be able to distinguish between similar audio sources located in different directions. To address this limitation, we have generalized MAVS to spatial audio separation and proposed LAVSS: a location-guided audio-visual spatial audio separator. LAVSS is inspired by the correlation between spatial audio and visual location. We introduce the phase difference carried by binaural audio as spatial cues, and we utilize positional representations of sounding objects as additional modality guidance. We also leverage multi-level cross-modal attention to perform visual-positional collaboration with audio features. In addition, we adopt a pre-trained monaural separator to transfer knowledge from rich mono sounds to boost spatial audio separation. This exploits the correlation between monaural and binaural channels. Experiments on the FAIR-Play dataset demonstrate the superiority of the proposed LAVSS over existing benchmarks of audio-visual separation. Our project page: https://yyx666660.github.io/LAVSS/.

现有的机器学习研究在单声道视听分离方面取得了令人期待的结果。然而，大多数视听分离方法只考虑声源是什么而不考虑其位置。这在虚拟实境/增强实境场景中可能成为一个问题，因为用户需要能够区分不同方向上的相似音频源。为解决这一限制，我们将视听分离推广到空间音频分离，并提出了一种基于位置引导的音频-视觉空间音频分离器(LAVSS)。LAVSS受到空间音频和视觉位置之间的相关性的启发。我们引入了双耳音频中包含的相位差作为空间线索，并利用发声对象的位置表示作为额外的模态指导。我们还采用多级跨模态注意力来进行视觉-位置的协作，并利用预训练的单声道分离器从丰富的单声道音频中转移知识以提高空间音频分离效果。在FAIR-Play数据集上的实验证明了所提出的LAVSS在视听分离方面的优越性。

基于位置引导的视听空间音频分离