TL;DR利用跨模态对齐和表示融合方法,在Social IQ 2.0数据集上取得了82.06%的准确率,增强了视频模态的利用能力,并通过降低语言过拟合和当前技术所遇到的视频模态绕过等问题,提高了性能。
Abstract
video-based question answering (Video QA) is a challenging task and becomes even more intricate when addressing socially intelligent question answering (SIQA). SIQA requires context understanding, temporal reason