Audio-visual segmentation (AVS) aims to segment sound sources in the video sequence, requiring a pixel-level understanding of audio-visual correspondence. As the Segment Anything Model (SAM) has strongly impacted extensive fields of dense prediction problems, prior works have investigated the introduction of SAM into AVS with audio as a new modality of the prompt. Nevertheless, constrained by SAM's single-frame segmentation scheme, the temporal context across multiple frames of audio-visual data remains insufficiently utilized. To this end, we study the extension of SAM's capabilities to the sequence of audio-visual scenes by analyzing contextual cross-modal relationships across the frames. To achieve this, we propose a Spatio-Temporal, Bidirectional Audio-Visual Attention (ST-BAVA) module integrated into the middle of SAM's image encoder and mask decoder. It adaptively updates the audio-visual features to convey the spatio-temporal correspondence between the video frames and audio streams. Extensive experiments demonstrate that our proposed model outperforms the state-of-the-art methods on AVS benchmarks, especially with an 8.3% mIoU gain on a challenging multi-sources subset.

通过分析视频帧之间的上下文跨模态关系，研究将Segment Anything Model (SAM) 的能力扩展到音频-视觉场景序列，提出了一个融合了空时双向音频-视觉注意力(ST-BAVA)模块的模型，实现了对音频-视觉关联的像素级理解，实验结果表明该模型在音频-视觉分割任务中表现优于其他方法，尤其是在具有多个源的数据集上获得了8.3%的平均交并比增益。

将分割任何模型扩展到音频和时间维度进行音频-视觉分割