Skeleton sequences provide 3D trajectories of human skeleton joints. The spatial temporal information is very important for action recognition. Considering that deep convolutional neural network (CNN) is very powerful for feature learning in images, in this paper, we propose to transform a skeleton sequence into an image-based representation for spatial temporal information learning with CNN. Specifically, for each channel of the 3D coordinates, we represent the sequence into a clip with several gray images, which represent multiple spatial structural information of the joints. Those images are fed to a deep CNN to learn high-level features. The CNN features of all the three clips at the same time-step are concatenated in a feature vector. Each feature vector represents the temporal information of the entire skeleton sequence and one particular spatial relationship of the joints. We then propose a Multi-Task Learning Network (MTLN) to jointly process the feature vectors of all time-steps in parallel for action recognition. Experimental results clearly show the effectiveness of the proposed new representation and feature learning method for 3D action recognition.

本文提出了一种新的方法，使用骨架序列（即人类骨架关节的3D轨迹）进行三维动作识别，并使用深度神经网络进行空间时间特征学习和长期时间信息学习。实验结果表明该方法具有很好的识别效果。

一种用于三维动作识别的骨架序列新表示方法