Audio-visual question answering (AVQA) is a challenging task that requires
multistep spatio-temporal reasoning over multimodal contexts. To achieve scene
understanding ability similar to humans, the AVQA task presents specific
challenges, including effectively fusing audio and visual information and
capturing question-relevant audio-visual features while maintaining temporal
synchronization. This paper proposes a Target-aware Joint Spatio-Temporal
Grounding Network for AVQA to address these challenges. The proposed approach
has two main components: the Target-aware Spatial Grounding module, the
Tri-modal consistency loss and corresponding Joint audio-visual temporal
grounding module. The Target-aware module enables the model to focus on
audio-visual cues relevant to the inquiry subject by exploiting the explicit
semantics of text modality. The Tri-modal consistency loss facilitates the
interaction between audio and video during question-aware temporal grounding
and incorporates fusion within a simpler single-stream architecture.
Experimental results on the MUSIC-AVQA dataset demonstrate the effectiveness
and superiority of the proposed method over existing state-of-the-art methods.
Our code will be availiable soon.

本研究提出了一种针对音视频问答（AVQA）任务的目标感知联合时空基础网络，利用三种模态的一致性损失实现了问题感知的时空基础，增加了音频 - 视觉互动，采用了单一流结构中的融合方法，在 MUSIC-AVQA 数据集上的实验结果证明了该方法优越性及其有效性。

面向动态音视情境的目标感知时空推理问题回答

Target-Aware Spatio-Temporal Reasoning via Answering Questions in  Dynamics Audio-Visual Scenarios

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task,
which aims to answer questions regarding different visual objects, sounds, and
their associations in videos. The problem requires comprehensive multimodal
understanding and spatio-temporal reasoning over audio-visual scenes. To
benchmark this task and facilitate our study, we introduce a large-scale
MUSIC-AVQA dataset, which contains more than 45K question-answer pairs covering
33 different question templates spanning over different modalities and question
types. We develop several baselines and introduce a spatio-temporal grounded
audio-visual network for the AVQA problem. Our results demonstrate that AVQA
benefits from multisensory perception and our model outperforms recent A-, V-,
and AVQA approaches. We believe that our built dataset has the potential to
serve as testbed for evaluating and promoting progress in audio-visual scene
understanding and spatio-temporal reasoning. Code and dataset:
this http URL

本文研究了 Audio-Visual Question Answering（AVQA）任务，提出了一个包含超过 45K 个问题 - 答案对的 MUSIC-AVQA 数据集并使用多模态知识和视听场景的时空推理来解决该问题，结果表明我们的方法优于现有的 A-V 和 AVQA 方法。