In the context of Audio Visual Question Answering (AVQA) tasks, the audio
visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and
3) Semantic. Existing AVQA methods suffer from two major shortcomings; the
audio-visual (AV) information passing through the network isn't aligned on
Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic
information is often not balanced within a context; this results in poor
performance. In this paper, we propose a novel end-to-end Contextual
Multi-modal Alignment (CAD) network that addresses the challenges in AVQA
methods by i) introducing a parameter-free stochastic Contextual block that
ensures robust audio and visual alignment on the Spatial level; ii) proposing a
pre-training technique for dynamic audio and visual alignment on Temporal level
in a self-supervised setting, and iii) introducing a cross-attention mechanism
to balance audio and visual information on Semantic level. The proposed novel
CAD network improves the overall performance over the state-of-the-art methods
on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our
proposed contributions to AVQA can be added to the existing methods to improve
their performance without additional complexity requirements.

在这篇论文中，我们提出了一种新的上下文多模态对齐（CAD）网络，通过引入无参数随机上下文块来确保空间级别上的稳健音频和视觉对齐，提出了一种用于动态音频和视觉对齐的预训练技术，以及引入了交叉注意机制来平衡语义级别上的音频和视觉信息。在 MUSIC-AVQA 数据集上，该 CAD 网络相对于现有方法平均性能提高了 9.4％。我们还证明了将我们对 AVQA 的建议添加到现有方法中可以提高其性能，而不需要额外的复杂性要求。