We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer. We cast multi-instance pose estimation from images as a direct set prediction problem. Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image. Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss. POET reasons about the relations between detected humans and the full image context to directly predict the poses in parallel. We show that POET can achieve high accuracy on the challenging COCO keypoint detection task. To the best of our knowledge, this model is the first end-to-end trainable multi-instance human pose estimation method.

本研究提出了一种称为POET（POse Estimation Transformer）的端到端可训练的多实例姿态估计方法，通过将卷积神经网络与变压器编解码器结合，直接将多实例姿态估计从图像中预测出来。我们使用一种新的全局集合损失来训练POET，包括关键点损失，可见性损失和类别损失，证明其在COCO关键点检测任务上具有高精度和高速度。此外，我们还展示了将POET应用于动物姿态估计时的成功迁移学习。这是第一个端到端可训练的多实例姿态估计方法，是一个有前途的替代方法。

使用Transformer进行端到端可训练的多实例姿态估计