The 3D world limits the human body pose and the human body pose conveys information about the surrounding objects. Indeed, from a single image of a person placed in an indoor scene, we as humans are adept at resolving ambiguities of the human pose and room layout through our knowledge of the physical laws and prior perception of the plausible object and human poses. However, few computer vision models fully leverage this fact. In this work, we propose an end-to-end trainable model that perceives the 3D scene from a single RGB image, estimates the camera pose and the room layout, and reconstructs both human body and object meshes. By imposing a set of comprehensive and sophisticated losses on all aspects of the estimations, we show that our model outperforms existing human body mesh methods and indoor scene reconstruction methods. To the best of our knowledge, this is the first model that outputs both object and human predictions at the mesh level, and performs joint optimization on the scene and human poses.

本文提出了一种端到端可训练模型，以单个RGB图像感知3D场景，估计相机姿态和室内布局，并重建人体和物体网格。通过对所有估计方面施加全面而复杂的损失，我们证明了我们的模型优于现有的人体网格方法和室内场景重建方法。据我们所知，这是第一个在网格级别输出对象和人体预测，并对场景和人体姿态进行联合优化的模型。

从单幅图像中综合估计三维人体和场景网格