Recent transformer-based approaches have demonstrated excellent performance in 3D human pose estimation. However, they have a holistic view and by encoding global relationships between all the joints, they do not capture the local dependencies precisely. In this paper, we present a novel Attention-GCNFormer (AGFormer) block that divides the number of channels by using two parallel transformer and GCNFormer streams. Our proposed GCNFormer module exploits the local relationship between adjacent joints, outputting a new representation that is complementary to the transformer output. By fusing these two representation in an adaptive way, AGFormer exhibits the ability to better learn the underlying 3D structure. By stacking multiple AGFormer blocks, we propose MotionAGFormer in four different variants, which can be chosen based on the speed-accuracy trade-off. We evaluate our model on two popular benchmark datasets: Human3.6M and MPI-INF-3DHP. MotionAGFormer-B achieves state-of-the-art results, with P1 errors of 38.4mm and 16.2mm, respectively. Remarkably, it uses a quarter of the parameters and is three times more computationally efficient than the previous leading model on Human3.6M dataset. Code and models are available at https://github.com/TaatiTeam/MotionAGFormer.

我们提出了一种新颖的Attention-GCNFormer（AGFormer）模块，通过使用两个并行的Transformer和GCNFormer流来减少通道数，以精确捕捉邻接关节之间的局部依赖关系。通过以适应性方式融合这两种表示，AGFormer模块在学习底层3D结构方面表现出更好的能力。通过堆叠多个AGFormer模块，我们提出了四个不同变体的MotionAGFormer模型，可以根据速度-准确性的权衡来选择。我们在两个常用基准数据集Human3.6M和MPI-INF-3DHP上评估了我们的模型。MotionAGFormer-B取得了最先进的结果，分别为38.4mm和16.2mm的P1错误率。值得注意的是，与Human3.6M数据集上的前一最先进模型相比，它使用了四分之一的参数，并且计算效率提高了三倍。该模型的代码和模型可在指定的网址上获得。

MotionAGFormer: 基于Transformer-GCNFormer网络的3D人体姿势估计增强