We present an approach that learns to synthesize high-quality, novel views of 3D objects or scenes, while providing fine-grained and precise control over the 6-DOF viewpoint. The approach is self-supervised and only requires 2D images and associated view transforms for training. Our main contribution is a network architecture that leverages a transforming auto-encoder in combination with a depth-guided warping procedure to predict geometrically accurate unseen views. Leveraging geometric constraints renders direct supervision via depth or flow maps unnecessary. If large parts of the object are occluded in the source view, a purely learning based prior is used to predict the values for dis-occluded pixels. Our network furthermore predicts a per-pixel mask, used to fuse depth-guided and pixel-based predictions. The resulting images reflect the desired 6-DOF transformation and details are preserved. We thoroughly evaluate our architecture on synthetic and real scenes and under fine-grained and fixed-view settings. Finally, we demonstrate that the approach generalizes to entirely unseen images such as product images downloaded from the internet.

本文提出了一个自监督学习的方法，通过深度引导的调整过程，利用变换自编码器的网络结构，在只有2D图像和相关视角变换的情况下精确合成高质量的3D对象或场景的新视角，并实现了细粒度和精密的六自由度视角控制。通过在合成和真实场景以及精细和固定视角设置下的彻底评估，证明了该方法的广泛适用性。

基于单目神经网络的连续视角控制图像渲染