Transformer architectures have become the model of choice in natural language processing and are now being introduced into computer vision tasks such as image classification, object detection, and semantic segmentation. However, in the field of human pose estimation, convolutional architectures still remain dominant. In this work, we present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos without convolutional architectures involved. Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure to comprehensively model the human joint relations within each frame as well as the temporal correlations across frames, then output an accurate 3D human pose of the center frame. We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets: Human3.6M and MPI-INF-3DHP. Extensive experiments show that PoseFormer achieves state-of-the-art performance on both datasets. Code is available at \url{https://github.com/zczcwh/PoseFormer}

本研究提出一种基于transformer的算法，能够在视频中实现3D人体姿态估计，通过对空间和时间进行变换，对每一帧中人体关节关系进行建模，并在中心帧上输出准确的3D人体姿态，该算法在Human3.6M和MPI-INF-3DHP数据集上实现了最先进的成果。

使用空间和时间转换器的三维人体姿势估计