Recent video question answering benchmarks indicate that state-of-the-art models struggle to answer compositional questions. However, it remains unclear which types of compositional reasoning cause models to mispredict. Furthermore, it is difficult to discern whether models arrive at answers using compositional reasoning or by leveraging data biases. In this paper, we develop a question decomposition engine that programmatically deconstructs a compositional question into a directed acyclic graph of sub-questions. The graph is designed such that each parent question is a composition of its children. We present AGQA-Decomp, a benchmark containing $2.3M$ question graphs, with an average of $11.49$ sub-questions per graph, and $4.55M$ total new sub-questions. Using question graphs, we evaluate three state-of-the-art models with a suite of novel compositional consistency metrics. We find that models either cannot reason correctly through most compositions or are reliant on incorrect reasoning to reach answers, frequently contradicting themselves or achieving high accuracies when failing at intermediate reasoning steps.

本文开发了一个问题分解引擎，能将组合问题分解为子问题的有向无环图。使用问题图，我们评估了三个最先进的模型，并使用一组新的组成一致性指标。 我们发现，这些模型无法正确地通过大多数构图进行推理，或者依赖于错误推理来获得答案，并在中间的推理步骤失败时频繁地自相矛盾或达到高准确度。

衡量视频问答的组成一致性