Monocular depth estimation in colonoscopy video aims to overcome the unusual lighting properties of the colonoscopic environment. One of the major challenges in this area is the domain gap between annotated but unrealistic synthetic data and unannotated but realistic clinical data. Previous attempts to bridge this domain gap directly target the depth estimation task itself. We propose a general pipeline of structure-preserving synthetic-to-real (sim2real) image translation (producing a modified version of the input image) to retain depth geometry through the translation process. This allows us to generate large quantities of realistic-looking synthetic images for supervised depth estimation with improved generalization to the clinical domain. We also propose a dataset of hand-picked sequences from clinical colonoscopies to improve the image translation process. We demonstrate the simultaneous realism of the translated images and preservation of depth maps via the performance of downstream depth estimation on various datasets.

本研究解决了结肠镜视频中单目深度估计面临的合成数据与实际临床数据之间的域差距问题。提出了一种结构保持的合成到真实图像翻译通用流程，以生成大量逼真的合成图像，从而提高深度估计的泛化能力。研究结果表明，翻译后的图像在保持深度几何特征的同时具有较高的真实性，有助于下游深度估计任务表现的提升。

保持结构的图像翻译用于结肠镜视频中的深度估计