We study the problem of dynamic visual reasoning on raw videos. This is a challenging problem; currently, state-of-the-art models often require dense supervision on physical object properties and events from simulation, which are impractical to obtain in real life. In this paper, we present the Dynamic Concept Learner (DCL), a unified framework that grounds physical objects and events from video and language. DCL first adopts a trajectory extractor to track each object over time and to represent it as a latent, object-centric feature vector. Building upon this object-centric representation, DCL learns to approximate the dynamic interaction among objects using graph networks. DCL further incorporates a semantic parser to parse questions into semantic programs and, finally, a program executor to run the program to answer the question, levering the learned dynamics model. After training, DCL can detect and associate objects across the frames, ground visual properties, and physical events, understand the causal relationship between events, make future and counterfactual predictions, and leverage these extracted presentations for answering queries. DCL achieves state-of-the-art performance on CLEVRER, a challenging causal video reasoning dataset, even without using ground-truth attributes and collision labels from simulations for training. We further test DCL on a newly proposed video-retrieval and event localization dataset derived from CLEVRER, showing its strong generalization capacity.

本文提出了一种统一框架-动态概念学习者（DCL）用于从视频及自然语言文本中，对物理对象和事件进行建模，其中DCL采用轨迹提取器来追踪每个物体随时间的变化并将其表示为一种潜在目标中心的特征向量，并进一步将物体集成到图形网络中学习物体之间的动态交互关系，最终通过语义分析器解析问题并执行执行器来回答问题，该方法在CLEVRER数据集上实现了state-of-the-art的表现。

通过动态视觉推理理解对象和事件的物理概念