Visual events are a composition of temporal actions involving actors spatially interacting with objects. When developing computer vision models that can reason about compositional spatio-temporal events, we need benchmarks that can analyze progress and uncover shortcomings. Existing video question answering benchmarks are useful, but they often conflate multiple sources of error into one accuracy metric and have strong biases that models can exploit, making it difficult to pinpoint model weaknesses. We present Action Genome Question Answering (AGQA), a new benchmark for compositional spatio-temporal reasoning. AGQA contains $192M$ unbalanced question answer pairs for $9.6K$ videos. We also provide a balanced subset of $3.9M$ question answer pairs, $3$ orders of magnitude larger than existing benchmarks, that minimizes bias by balancing the answer distributions and types of question structures. Although human evaluators marked $86.02\%$ of our question-answer pairs as correct, the best model achieves only $47.74\%$ accuracy. In addition, AGQA introduces multiple training/test splits to test for various reasoning abilities, including generalization to novel compositions, to indirect references, and to more compositional steps. Using AGQA, we evaluate modern visual reasoning systems, demonstrating that the best models barely perform better than non-visual baselines exploiting linguistic biases and that none of the existing models generalize to novel compositions unseen during training.

本论文针对计算机视觉模型做了一个新的基准，名为Action Genome Question Answering (AGQA)，并提供了3.9M个问题答案对的平衡子集，以最小化偏倚。AGQA引入了多个训练/测试集来测试各种推理能力，包括新颖组合的泛化，间接引用以及更多的组合步骤。这项研究发现，最好的模型仅能比利用语言偏见的非视觉基准优秀一些；并且现有的模型都无法推广到训练中未见过的新颖组合。

AGQA：组合式时空推理的基准测试