In this paper, we propose a novel video depth estimation approach, FutureDepth, which enables the model to implicitly leverage multi-frame and motion cues to improve depth estimation by making it learn to predict the future at training. More specifically, we propose a future prediction network, F-Net, which takes the features of multiple consecutive frames and is trained to predict multi-frame features one time step ahead iteratively. In this way, F-Net learns the underlying motion and correspondence information, and we incorporate its features into the depth decoding process. Additionally, to enrich the learning of multiframe correspondence cues, we further leverage a reconstruction network, R-Net, which is trained via adaptively masked auto-encoding of multiframe feature volumes. At inference time, both F-Net and R-Net are used to produce queries to work with the depth decoder, as well as a final refinement network. Through extensive experiments on several benchmarks, i.e., NYUDv2, KITTI, DDAD, and Sintel, which cover indoor, driving, and open-domain scenarios, we show that FutureDepth significantly improves upon baseline models, outperforms existing video depth estimation methods, and sets new state-of-the-art (SOTA) accuracy. Furthermore, FutureDepth is more efficient than existing SOTA video depth estimation models and has similar latencies when comparing to monocular models

本研究提出一种新颖的视频深度估计方法——FutureDepth，通过让模型在训练时学习预测未来来隐式地利用多帧和运动线索来改善深度估计。通过将多帧特征输入到未来预测网络F-Net中，模型迭代地预测多帧特征，从而学习了底层的运动和对应信息，并将其特征融入到深度解码过程中。为了丰富多帧对应线索的学习，还利用自适应掩码的多帧特征体积的重建网络R-Net进行训练。通过在多个基准测试集上进行广泛实验，包括室内、驾驶和开放领域等场景，实验证明FutureDepth在准确性方面显著优于基线模型，超过了现有的视频深度估计方法，并创造了最新的准确性水平。此外，与现有的最新视频深度估计模型相比，FutureDepth更高效，在与单目模型的比较中具有类似的延迟。

FutureDepth: 学习预测未来提高视频深度估计