Spatio-temporal scene graphs represent interactions in a video by decomposing
scenes into individual objects and their pair-wise temporal relationships.
Long-term anticipation of the fine-grained pair-wise relationships between
objects is a challenging problem. To this end, we introduce the task of Scene
Graph Anticipation (SGA). We adapt state-of-the-art scene graph generation
methods as baselines to anticipate future pair-wise relationships between
objects and propose a novel approach SceneSayer. In SceneSayer, we leverage
object-centric representations of relationships to reason about the observed
video frames and model the evolution of relationships between objects. We take
a continuous time perspective and model the latent dynamics of the evolution of
object interactions using concepts of NeuralODE and NeuralSDE, respectively. We
infer representations of future relationships by solving an Ordinary
Differential Equation and a Stochastic Differential Equation, respectively.
Extensive experimentation on the Action Genome dataset validates the efficacy
of the proposed methods.

分析了视频中的时空场景图，提出了 SceneSayer 方法，通过对观察到的视频帧进行推理，模拟对象之间关系的演化，并使用神经常微分方程和神经随机微分方程来预测对象之间的未来关系。在 Action Genome 数据集上进行了大量实验验证了该方法的效果。

走向场景图预测

Towards Scene Graph Anticipation

The task of predicting future actions from a video is crucial for a
real-world agent interacting with others. When anticipating actions in the
distant future, we humans typically consider long-term relations over the whole
sequence of actions, i.e., not only observed actions in the past but also
potential actions in the future. In a similar spirit, we propose an end-to-end
attention model for action anticipation, dubbed Future Transformer (FUTR), that
leverages global attention over all input frames and output tokens to predict a
minutes-long sequence of future actions. Unlike the previous autoregressive
models, the proposed method learns to predict the whole sequence of future
actions in parallel decoding, enabling more accurate and fast inference for
long-term anticipation. We evaluate our method on two standard benchmarks for
long-term action anticipation, Breakfast and 50 Salads, achieving
state-of-the-art results.

提出了一种全新的基于注意力机制的动作预测模型，称为 Future Transformer (FUTR)，它能够学习全局的视频动作信息，以预测长达数分钟的未来动作序列，和传统的自回归模型相比，FUTR 可以更准确、更快速地进行长期预测。在两个标准数据集 Breakfast 和 50Salads 上进行了实验，FUTR 模型取得了最新的最优结果。