Despite significant progress in monocular depth estimation in the wild, recent state-of-the-art methods cannot be used to recover accurate 3D scene shape due to an unknown depth shift induced by shift-invariant reconstruction losses used in mixed-data depth prediction training, and possible unknown camera focal length. We investigate this problem in detail, and propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image, and then use 3D point cloud encoders to predict the missing depth shift and focal length that allow us to recover a realistic 3D scene shape. In addition, we propose an image-level normalized regression loss and a normal-based geometry loss to enhance depth prediction models trained on mixed datasets. We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot dataset generalization. Code is available at: https://git.io/Depth

研究探讨了在混合数据深度预测训练中使用的位移不变重建损失所引起的未知深度偏移及可能的未知相机焦距问题，并设计了一个两阶段框架以实现单目图像深度预测，并使用三维点云编码器预测遗漏的深度偏移和焦距以恢复逼真的3D场景形状，本文提出了图像级标准化回归损失和基于法向几何损失的方法，以增强用混合数据集训练的深度预测模型，该深度模型在9个不可见数据集上测试并取得了零样本数据集泛化的最新性能记录。

从单张图片学习恢复三维场景形状