Multi-person 3D human pose estimation from a single image is a challenging problem, especially for in-the-wild settings due to the lack of 3D annotated data. We propose HG-RCNN, a Mask-RCNN based network that also leverages the benefits of the Hourglass architecture for multi-person 3D Human Pose Estimation. A two-staged approach is presented that first estimates the 2D keypoints in every Region of Interest (RoI) and then lifts the estimated keypoints to 3D. Finally, the estimated 3D poses are placed in camera-coordinates using weak-perspective projection assumption and joint optimization of focal length and root translations. The result is a simple and modular network for multi-person 3D human pose estimation that does not require any multi-person 3D pose dataset. Despite its simple formulation, HG-RCNN achieves the state-of-the-art results on MuPoTS-3D while also approximating the 3D pose in the camera-coordinate system.

通过提出的 HG-RCNN 网络，借助 Mask-RCNN 和 Hourglass 结构进行多人 3D 人体姿态估计，实现对每个感兴趣区域（RoI）中 2D 关键点的先预测后提升，最终采用弱透视投影模型和焦距和根偏移的联合优化将估计的 3D 姿态置于相机坐标系下，该网络简单模块化且无需多人 3D 姿态数据集，取得了MuPoTS-3D 数据集的最优性能，并能近似在相机坐标系下估计 3D 姿态。

从单目图像估计多人三维人体姿态