Short-Term object-interaction Anticipation (STA) consists of detecting the location of the next-active objects, the noun and verb categories of the interaction, and the time to contact from the observation of egocentric video. We propose STAformer, a novel attention-based architecture integrating frame-guided temporal pooling, dual image-video attention, and multi-scale feature fusion to support STA predictions from an image-input video pair. Moreover, we introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. On the test set, our results obtain a final 33.5 N mAP, 17.25 N+V mAP, 11.77 N+{\delta} mAP and 6.75 Overall top-5 mAP metric when trained on the v2 training dataset.

通过STAformer模型，结合基于注意力的架构、时间池化、图像-视频注意力以及多尺度特征融合等方法，可以从图像输入视频对中预测短期物体交互的位置、名词和动词类别，以及与观察到的双眼视角视频相关的接触时间。此外，通过模拟适应性，提供两个新模块来支持STA预测，分别是对物体运动轨迹和手部观察的交互热点预测，并在热点周围提高STA预测的可信度。

ZARRIO @ Ego4D短期物体交互预测挑战：利用功能性和基于注意力的模型进行STA