Recently, the vision transformer and its variants have played an increasingly important role in both monocular and multi-view human pose estimation. Considering image patches as tokens, transformers can model the global dependencies within the entire image or across images from other views. However, global attention is computationally expensive. As a consequence, it is difficult to scale up these transformer-based methods to high-resolution features and many views. In this paper, we propose the token-Pruned Pose Transformer (PPT) for 2D human pose estimation, which can locate a rough human mask and performs self-attention only within selected tokens. Furthermore, we extend our PPT to multi-view human pose estimation. Built upon PPT, we propose a new cross-view fusion strategy, called human area fusion, which considers all human foreground pixels as corresponding candidates. Experimental results on COCO and MPII demonstrate that our PPT can match the accuracy of previous pose transformer methods while reducing the computation. Moreover, experiments on Human 3.6M and Ski-Pose demonstrate that our Multi-view PPT can efficiently fuse cues from multiple views and achieve new state-of-the-art results.

该论文提出了一种基于Transformer的2D人体姿态估计方法——Token-Pruned Pose Transformer（PPT）及其多视点姿态估计扩展，使用自我注意力仅在选定的标记中进行计算，采用一种名为人体区域融合的新的跨视图融合策略，通过在多视角中对来自多视点的提示进行高效融合，进而实现了与以前人体姿态Transformer方法相同的准确度，同时减少了计算量，并在Human 3.6M和Ski-Pose数据集上取得了新的最优结果。

单目和多视角人体姿态估计的token-修剪关键点变换器