For the current 3D human pose estimation task, in order to improve the efficiency of pose sequence output, we try to further improve the prediction stability in low input video frame scenarios.Many previous methods lack the understanding of local joint information.\cite{9878888}considers the temporal relationship of a single joint in this work.However, we found that there is a certain predictive correlation between the trajectories of different joints in time.Therefore, our proposed \textbf{Fusionformer} method introduces a self-trajectory module and a cross-trajectory module based on the spatio-temporal module.After that, the global spatio-temporal features and local joint trajectory features are fused through a linear network in a parallel manner.To eliminate the influence of bad 2D poses on 3D projections, finally we also introduce a pose refinement network to balance the consistency of 3D projections.In addition, we evaluate the proposed method on two benchmark datasets (Human3.6M, MPI-INF-3DHP). Comparing our method with the baseline method poseformer, the results show an improvement of 2.4\% MPJPE and 4.3\% P-MPJPE on the Human3.6M dataset, respectively.

提出了一种名为Fusionformer的方法用于3D人体姿态估计任务，通过引入自身轨迹模块、互相轨迹模块和全局空间时间特征以及局部关节轨迹特征的融合，最终通过姿势精炼网络来平衡3D投影的一致性，并在两个基准数据集上进行评估，结果表明与基线方法poseformer相比，在Human3.6M数据集上分别提高了2.4％的MPJPE和4.3％的P-MPJPE。

利用基于Transformer的融合网络探索联合动作协同性用于3D人体姿态估计