Monocular depth estimation is an ongoing challenge in computer vision. Recent progress with Transformer models has demonstrated notable advantages over conventional CNNs in this area. However, there's still a gap in understanding how these models prioritize different regions in 2D images and how these regions affect depth estimation performance. To explore the differences between Transformers and CNNs, we employ a sparse pixel approach to contrastively analyze the distinctions between the two. Our findings suggest that while Transformers excel in handling global context and intricate textures, they lag behind CNNs in preserving depth gradient continuity. To further enhance the performance of Transformer models in monocular depth estimation, we propose the Depth Gradient Refinement (DGR) module that refines depth estimation through high-order differentiation, feature fusion, and recalibration. Additionally, we leverage optimal transport theory, treating depth maps as spatial probability distributions, and employ the optimal transport distance as a loss function to optimize our model. Experimental results demonstrate that models integrated with the plug-and-play Depth Gradient Refinement (DGR) module and the proposed loss function enhance performance without increasing complexity and computational costs. This research not only offers fresh insights into the distinctions between Transformers and CNNs in depth estimation but also paves the way for novel depth estimation methodologies.

通过对比分析Transformer模型和CNN在处理2D图像中不同区域以及对深度估计性能的影响方面的差异，我们发现Transformer在处理全局上下文和复杂纹理方面表现出色，但在保持深度梯度连续性方面落后于CNN。为了进一步提高Transformer模型在单目深度估计中的性能，我们提出了深度梯度精炼（DGR）模块，通过高阶微分、特征融合和重新校准来提升深度估计。此外，我们利用最优输运理论将深度图像视为空间概率分布，并采用最优输运距离作为损失函数对模型进行优化。实验结果表明，与DGR模块和提出的损失函数集成的模型在不增加复杂性和计算成本的情况下提高了性能。这项研究不仅提供了关于Transformer和CNN在深度估计中的差异的新见解，而且为新的深度估计方法铺平了道路。

改善Transformer中的深度梯度连续性：以卷积神经网络为基础的单目深度估计的比较研究