Video question answering aims at answering a question about the video content by reasoning the alignment semantics within them. However, since relying heavily on human instructions, i.e., annotations or priors, current contrastive learning-based VideoQA methods remains challenging to perform fine-grained visual-linguistic alignments. In this work, we innovatively resort to game theory, which can simulate complicated relationships among multiple players with specific interaction strategies, e.g., video, question, and answer as ternary players, to achieve fine-grained alignment for VideoQA task. Specifically, we carefully design a VideoQA-specific interaction strategy to tailor the characteristics of VideoQA, which can mathematically generate the fine-grained visual-linguistic alignment label without label-intensive efforts. Our TG-VQA outperforms existing state-of-the-art by a large margin (more than 5%) on long-term and short-term VideoQA datasets, verifying its effectiveness and generalization ability. Thanks to the guidance of game-theoretic interaction, our model impressively convergences well on limited data (${10}^4 ~videos$), surpassing most of those pre-trained on large-scale data ($10^7~videos$).

本研究尝试通过博弈论的交互策略来实现细粒度的视频问答任务中的视觉语义对齐，无需过多的标注，相比现有方法，在长期和短期视频问答数据集上的效果有显著提升，并具有良好的泛化能力和在有限数据上的并行收敛能力。

TG-VQA：三元游戏视频问答