The goal of this paper is to compare surface-based and volumetric 3D object shape representations, as well as viewer-centered and object-centered reference frames for single-view 3D shape prediction. We propose a new algorithm for predicting depth maps from multiple viewpoints, with a single depth or RGB image as input. By modifying the network and the way models are evaluated, we can directly compare the merits of voxels vs. surfaces and viewer-centered vs. object-centered for familiar vs. unfamiliar objects, as predicted from RGB or depth images. Among our findings, we show that surface-based methods outperform voxel representations for objects from novel classes and produce higher resolution outputs. We also find that using viewer-centered coordinates is advantageous for novel objects, while object-centered representations are better for more familiar objects. Interestingly, the coordinate frame significantly affects the shape representation learned, with object-centered placing more importance on implicitly recognizing the object category and viewer-centered producing shape representations with less dependence on category recognition.

本篇论文的目标是比较基于表面和基于体积的3D对象形状表示，以及基于观察者和基于对象的参照框架，在单视角3D形状预测中的表现。我们提出了一种新算法，可以从多个视角预测深度图，以单个深度或RGB图像为输入，并修改了网络和模型评估方式，以直接比较表面和体素、观察者和对象中心对熟悉和不熟悉对象的预测表现。我们发现，基于表面的方法优于基于体素的方法，对来自新类的对象产生更高分辨率的输出。基于观察者中心的坐标有助于处理新对象，而基于对象中心的表示法更适合处理熟悉的对象。值得注意的是，坐标框架对学习的形状表示有显著影响，基于对象中心的表示法更加重视隐式地识别对象类别，而基于观察者中心则对类别识别的依赖较少。

像素、体素与视图: 单视角 3D 物体形状预测的形状表达研究