Monocular 3D object parsing is highly desirable in various scenarios including occlusion reasoning and holistic scene interpretation. We present a deep convolutional neural network (CNN) architecture to localize object semantic parts in 2D image and 3D space while inferring their visibility states, given a single RGB image. Our key insight is to exploit domain knowledge to regularize the network by deeply supervising its hidden layers, in order to sequentially infer a causal sequence of intermediate concepts associated with the final task. To acquire training data in desired quantities with ground truth 3D shape and intermediate concepts, we render 3D object CAD models to generate large-scale synthetic data and simulate challenging occlusion configurations between objects. We train the network only on synthetic data and demonstrate state-of-the-art performances on real image benchmarks including an extended version of KITTI, PASCAL VOC, PASCAL3D+ and IKEA for 2D and 3D keypoint localization and instance segmentation. The empirical results substantiate the utility of deep supervision scheme by demonstrating effective transfer of knowledge from synthetic data to real images, resulting in less overfitting compared to standard end-to-end training.

本研究中，我们利用深度卷积神经网络架构在2D图像和3D空间中定位语义部件并推断它们的可见性状态，其利用合成数据和模拟的遮挡情况训练网络，并表明了其在现实图像基准测试中具有最先进的性能和有效的迁移知识。

基于形状概念的深度监督算法用于考虑遮挡的三维物体分割