We propose a new spatio-temporal attention based mechanism for human action recognition able to automatically attend to the hands most involved into the studied action and detect the most discriminative moments in an action. Attention is handled in a recurrent manner employing Recurrent Neural Network (RNN) and is fully-differentiable. In contrast to standard soft-attention based mechanisms, our approach does not use the hidden RNN state as input to the attention model. Instead, attention distributions are extracted using external information: human articulated pose. We performed an extensive ablation study to show the strengths of this approach and we particularly studied the conditioning aspect of the attention mechanism. We evaluate the method on the largest currently available human action recognition dataset, NTU-RGB+D, and report state-of-the-art results. Other advantages of our model are certain aspects of explanability, as the spatial and temporal attention distributions at test time allow to study and verify on which parts of the input data the method focuses.

本研究提出一种基于时空注意力机制的人体动作识别方法，采用外部信息（人的姿态）提取注意力分布，采用RNN实现注意力的递归处理，以实现自动关注动作中最活跃的手部，并检测最具区分度的动作要素，并在NTU-RGB + D数据集上获得最先进的结果。

基于姿态注意力的手势人体动作识别