This paper focuses on structured-output learning using deep neural networks for 3D human pose estimation from monocular images. Our network takes an image and 3D pose as inputs and outputs a score value, which is high when the image-pose pair matches and low otherwise. The network structure consists of a convolutional neural network for image feature extraction, followed by two sub-networks for transforming the image features and pose into a joint embedding. The score function is then the dot-product between the image and pose embeddings. The image-pose embedding and score function are jointly trained using a maximum-margin cost function. Our proposed framework can be interpreted as a special form of structured support vector machines where the joint feature space is discriminatively learned using deep neural networks. We test our framework on the Human3.6m dataset and obtain state-of-the-art results compared to other recent methods. Finally, we present visualizations of the image-pose embedding space, demonstrating the network has learned a high-level embedding of body-orientation and pose-configuration.

本文提出了一种利用深度神经网络进行结构化输出学习的方法，用于从单眼图像中估计3D人体姿势。该方法将图像和3D姿势作为输入，通过卷积神经网络将图像特征提取出来，并通过两个分支网络将图像特征和姿态转换为联合嵌入，然后将联合嵌入相乘得到一个分数值。通过最大间隔代价函数来联合训练联合嵌入和分数函数，实现的网络为一种特殊形式的结构化支持向量机，它是使用深度神经网络区别性地学习联合特征空间。在Human3.6m数据集上测试了这个框架，并与其他最近的方法进行了比较，得到了最先进的结果。最后，我们展示了图像 - 姿态嵌入空间的可视化，证明了网络已经学习到了高水平的身体方向和姿态配置。

深度网络在3D人体姿态估计中的最大边界结构化学习