For the task of simultaneous monocular depth and visual odometry estimation, we propose learning self-supervised transformer-based models in two steps. Our first step consists in a generic pretraining to learn 3D geometry, using cross-view completion objective (CroCo), followed by self-supervised finetuning on non-annotated videos. We show that our self-supervised models can reach state-of-the-art performance 'without bells and whistles' using standard components such as visual transformers, dense prediction transformers and adapters. We demonstrate the effectiveness of our proposed method by running evaluations on six benchmark datasets, both static and dynamic, indoor and outdoor, with synthetic and real images. For all datasets, our method outperforms state-of-the-art methods, in particular for depth prediction task.

我们提出了一种使用自监督变换器模型进行单目深度和视觉里程计估计任务的方法，分为两个步骤：第一步是使用跨视角补全目标（CroCo）进行通用预训练以学习3D几何，然后在非标注视频上进行自监督微调。我们展示了我们的自监督模型可以通过使用视觉变换器、密集预测变换器和适配器等标准组件达到最先进的性能。通过在六个基准数据集上进行评估，包括静态和动态、室内和室外、合成和真实图像，我们证明了我们提出方法的有效性，尤其在深度预测任务中超过了最先进的方法。

自监督预训练和微调用于单目深度和视觉里程计