360$^\circ$ videos convey holistic views for the surroundings of a scene. It provides audio-visual cues beyond pre-determined normal field of views and displays distinctive spatial relations on a sphere. However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings. We propose a novel benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos. Using 5.4K 360$^\circ$ video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding: spherical spatial relation QAs and audio-visual relation QAs. We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to a better semantic understanding of the panoramic surroundings on the dataset.

本文提出了Pano-AVQA基准测试用于评估全景视频中音频-视觉关系和球形空间关系的语义理解。使用在线获取的5.4K个视频剪辑，收集了两种类型的新型问题-答案对。通过球形空间嵌入和多模态训练目标，使用多个基于Transformer的模型从Pano-AVQA中进行训练，结果表明我们的提出的球形空间嵌入和多模态训练目标对数据集上全景环境的语义理解有很好的帮助。

Pano-AVQA: 360°视频上基于感知的音视问题回答