Understanding human actions in wild videos is an important task with a broad
range of applications. In this paper we propose a novel approach named
Hierarchical Attention Network (HAN), which enables to incorporate static
spatial information, short-term motion information and long-term video temporal
structures for complex human action understanding. Compared to recent
convolutional neural network based approaches, HAN has following advantages (1)
HAN can efficiently capture video temporal structures in a longer range; (2)
HAN is able to reveal temporal transitions between frame chunks with different
time steps, i.e. it explicitly models the temporal transitions between frames
as well as video segments and (3) with a multiple step spatial temporal
attention mechanism, HAN automatically learns important regions in video frames
and temporal segments in the video. The proposed model is trained and evaluated
on the standard video action benchmarks, i.e., UCF-101 and HMDB-51, and it
significantly outperforms the state-of-the arts

本文提出 Hierarchical Attention Network（HAN）用于实现复杂的人类行为理解。该模型可以同时融合视频的静态空间信息，短期运动信息和长期视频时间结构，并利用多步骤空间时间关注机制来自动学习视频帧中的重要区域和时间片段，最终在标准的视频行为基准测试中显著优于现有技术。