Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning
task, demanding intelligent systems to accurately respond to natural language
queries based on audio-video input pairs. Nevertheless, prevalent AVQA
approaches are prone to overlearning dataset biases, resulting in poor
robustness. Furthermore, current datasets may not provide a precise diagnostic
for these methods. To tackle these challenges, firstly, we propose a novel
dataset, \textit{MUSIC-AVQA-R}, crafted in two steps: rephrasing questions
within the test split of a public dataset (\textit{MUSIC-AVQA}) and
subsequently introducing distribution shifts to split questions. The former
leads to a large, diverse test space, while the latter results in a
comprehensive robustness evaluation on rare, frequent, and overall questions.
Secondly, we propose a robust architecture that utilizes a multifaceted cycle
collaborative debiasing strategy to overcome bias learning. Experimental
results show that this architecture achieves state-of-the-art performance on
both datasets, especially obtaining a significant improvement of 9.68\% on the
proposed dataset. Extensive ablation experiments are conducted on these two
datasets to validate the effectiveness of the debiasing strategy. Additionally,
we highlight the limited robustness of existing multi-modal QA methods through
the evaluation on our dataset.

音频 - 视觉问答（AVQA）是一个复杂的多模态推理任务，要求智能系统基于音频 - 视频输入对准确地回答自然语言查询。然而，现有的 AVQA 方法容易过度学习数据集偏差，导致鲁棒性差。我们提出了一个新的数据集（MUSIC-AVQA-R），并提出了一个鲁棒的架构，通过多方位的循环协作去偏策略来克服偏差学习问题。结果表明，该架构在两个数据集上均取得了最先进的性能，特别是在我们提出的数据集上提升了 9.68％。通过对我们的数据集进行评估，还突显了现有的多模态 QA 方法的有限鲁棒性。