In recent years, 2D human pose estimation has made significant progress on public benchmarks. However, many of these approaches face challenges of less applicability in the industrial community due to the large number of parametric quantities and computational overhead. Efficient human pose estimation remains a hurdle, especially for whole-body pose estimation with numerous keypoints. While most current methods for efficient human pose estimation primarily rely on CNNs, we propose the Group-based Token Pruning Transformer (GTPT) that fully harnesses the advantages of the Transformer. GTPT alleviates the computational burden by gradually introducing keypoints in a coarse-to-fine manner. It minimizes the computation overhead while ensuring high performance. Besides, GTPT groups keypoint tokens and prunes visual tokens to improve model performance while reducing redundancy. We propose the Multi-Head Group Attention (MHGA) between different groups to achieve global interaction with little computational overhead. We conducted experiments on COCO and COCO-WholeBody. Compared to other methods, the experimental results show that GTPT can achieve higher performance with less computation, especially in whole-body with numerous keypoints.

通过引入逐步引入关键点的粗到细操作，Group-based Token Pruning Transformer (GTPT)有效降低了计算负担并确保高性能的有效人体姿态估计方法。通过将关键点令牌分组和修剪视觉令牌来提高模型性能并减少冗余，同时利用Multi-Head Group Attention (MHGA)实现全局交互。实验结果表明，GTPT在整体和全身多姿态的情况下，能够以较少的计算能力实现更高的性能。

GTPT: 基于组别的令牌修剪变换器用于高效的人体姿势估计