Audio-visual question answering (AVQA) is a challenging task that requires
multistep spatio-temporal reasoning over multimodal contexts. To achieve scene
understanding ability similar to humans, the AVQA task presents specific
challenges, including effectively fusing audio and visual information and
capturing question-relevant audio-visual features while maintaining temporal
synchronization. This paper proposes a Target-aware Joint Spatio-Temporal
Grounding Network for AVQA to address these challenges. The proposed approach
has two main components: the Target-aware Spatial Grounding module, the
Tri-modal consistency loss and corresponding Joint audio-visual temporal
grounding module. The Target-aware module enables the model to focus on
audio-visual cues relevant to the inquiry subject by exploiting the explicit
semantics of text modality. The Tri-modal consistency loss facilitates the
interaction between audio and video during question-aware temporal grounding
and incorporates fusion within a simpler single-stream architecture.
Experimental results on the MUSIC-AVQA dataset demonstrate the effectiveness
and superiority of the proposed method over existing state-of-the-art methods.
Our code will be availiable soon.

本研究提出了一种针对音视频问答（AVQA）任务的目标感知联合时空基础网络，利用三种模态的一致性损失实现了问题感知的时空基础，增加了音频 - 视觉互动，采用了单一流结构中的融合方法，在 MUSIC-AVQA 数据集上的实验结果证明了该方法优越性及其有效性。