Video question answering requires models to understand and reason about both complex video and language data to correctly derive answers. Existing efforts focus on designing sophisticated cross-modal interactions to fuse the information from two modalities, while encoding the video and question holistically as frame and word sequences. Despite their success, these methods are essentially revolving around the sequential nature of video- and question-contents, providing little insight to the problem of question-answering and lacking interpretability as well. In this work, we argue that while video is presented in frame sequence, the visual elements (eg, objects, actions, activities and events) are not sequential but rather hierarchical in semantic space. To align with the multi-granular essence of linguistic concepts in language queries, we propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner, with the guidance of corresponding textual cues. Despite the simplicity, our extensive experiments demonstrate the superiority of such conditional hierarchical graph architecture, with clear performance improvements over prior methods and also better generalization across different type of questions. Further analyses also consolidate the model's reliability as it shows meaningful visual-textual evidences for the predicted answers.

本文提出了一种将视频建模为条件分层图层次结构的方法，通过组合不同层次的视觉元素来对齐语言查询中的多粒度语义概念，该方法超越了先前方法的表现，且对于不同类型的问题也具有更好的泛化能力。

利用视频作为条件图层级的多粒度问答