Skeleton-based gesture recognition methods have achieved high success using Graph Convolutional Network (GCN). In addition, context-dependent adaptive topology as a neighborhood vertex information and attention mechanism leverages a model to better represent actions. In this paper, we propose self-attention GCN hybrid model, Multi-Scale Spatial-Temporal self-attention (MSST)-GCN to effectively improve modeling ability to achieve state-of-the-art results on several datasets. We utilize spatial self-attention module with adaptive topology to understand intra-frame interactions within a frame among different body parts, and temporal self-attention module to examine correlations between frames of a node. These two are followed by multi-scale convolution network with dilations, which not only captures the long-range temporal dependencies of joints but also the long-range spatial dependencies (i.e., long-distance dependencies) of node temporal behaviors. They are combined into high-level spatial-temporal representations and output the predicted action with the softmax classifier.

通过使用自注意力图卷积网络(GCN)技术，本研究提出了一种混合模型，名为多尺度时空自注意力网络(MSST-GCN)，以有效提高建模能力，并在多个数据集上取得了最先进的结果。该模型利用自空间注意力模块来理解帧内不同身体部位之间的关系，利用自时间注意力模块来研究节点帧之间的相关性。随后，通过多尺度卷积网络捕获节点的长程时空依赖关系，将它们组合成高层次的时空表示，并使用softmax分类器输出预测的动作。

基于多尺度时空自注意力图卷积网络的基于骨架的动作识别