Audio visual segmentation (AVS) aims to segment the sounding objects for each
frame of a given video. To distinguish the sounding objects from silent ones,
both audio-visual semantic correspondence and temporal interaction are
required. The previous method applies multi-frame cross-modal attention to
conduct pixel-level interactions between audio features and visual features of
multiple frames simultaneously, which is both redundant and implicit. In this
paper, we propose an Audio-Queried Transformer architecture, AQFormer, where we
define a set of object queries conditioned on audio information and associate
each of them to particular sounding objects. Explicit object-level semantic
correspondence between audio and visual modalities is established by gathering
object information from visual features with predefined audio queries. Besides,
an Audio-Bridged Temporal Interaction module is proposed to exchange sounding
object-relevant information among multiple frames with the bridge of audio
features. Extensive experiments are conducted on two AVS benchmarks to show
that our method achieves state-of-the-art performances, especially 7.1% M_J and
7.6% M_F gains on the MS3 setting.

我们提出了一种基于音频查询的 Transformer 架构 (AQFormer)，通过在视觉特征中利用预定义的音频查询聚集对象信息，建立了音频和视觉模态之间的明确的对象级语义对应关系，并提出了一种基于音频的时间交互模块来在多帧之间交换与声音对象相关的信息，实验结果证明我们的方法在两个 AVS 基准测试集上取得了最先进的性能，尤其在 MS3 设置上取得了 7.1% 的 M_J 增益和 7.6% 的 M_F 增益。