Objects are three-dimensional entities, but visual observations are largely 2D. Inferring 3D properties from individual 2D views is thus a generically useful skill that is critical to object perception. We ask the question: can we learn useful image representations by explicitly training a system to infer 3D shape from 2D views? The few prior attempts at single view 3D reconstruction all target the reconstruction task as an end in itself, and largely build category-specific models to get better reconstructions. In contrast, we are interested in this task as a means to learn generic visual representations that embed knowledge of 3D shape properties from arbitrary object views. We train a single category-agnostic neural network from scratch to produce a complete image-based shape representation from one view of a generic object in a single forward pass. Through comparison against several baselines on widely used shape datasets, we show that our system learns to infer shape for generic objects including even those from categories that are not present in the training set. In order to perform this "mental rotation" task, our system is forced to learn intermediate image representations that embed object geometry, without requiring any manual supervision. We show that these learned representations outperform other unsupervised representations on various semantic tasks, such as object recognition and object retrieval.

本论文介绍了一种无监督学习方法，将三维形状信息嵌入到单视图图像表示中，通过使用单个2D图像的自监督训练目标，在没有人工语义标签的前提下，鼓励表示捕捉基本形状原语和语义规律，最终学习得到一个强大的表示方法，可以成功进行物体识别和“心理旋转”操作，成果优于相对应的其他无监督学习方法。

ShapeCodes: 将视角提升至视图网格以进行自我监督特征学习