This paper presents StereoNet, the first end-to-end deep architecture for real-time stereo matching that runs at 60 fps on an NVidia Titan X, producing high-quality, edge-preserved, quantization-free disparity maps. A key insight of this paper is that the network achieves a sub-pixel matching precision than is a magnitude higher than those of traditional stereo matching approaches. This allows us to achieve real-time performance by using a very low resolution cost volume that encodes all the information needed to achieve high disparity precision. Spatial precision is achieved by employing a learned edge-aware upsampling function. Our model uses a Siamese network to extract features from the left and right image. A first estimate of the disparity is computed in a very low resolution cost volume, then hierarchically the model re-introduces high-frequency details through a learned upsampling function that uses compact pixel-to-pixel refinement networks. Leveraging color input as a guide, this function is capable of producing high-quality edge-aware output. We achieve compelling results on multiple benchmarks, showing how the proposed method offers extreme flexibility at an acceptable computational budget.

提出了StereoNet，这是第一个端到端实时立体匹配的深度学习架构，在NVidia Titan X上以60fps运行，产生高质量，边缘保留且无量化的视差图。 该网络具有超像素匹配精度的关键洞见，比传统立体匹配方法高一个数量级，通过使用低分辨率代价体编码所需的所有信息，从而实现实时性。采用学习的边缘感知上采样函数实现空间精度，并使用Siamese网络从左右图像提取特征。在非常低的分辨率代价体中计算视差的初步估计，然后模型通过使用紧凑的像素到像素细化网络的学习上采样函数分层地重新引入高频细节。利用颜色输入作为指南，该函数能够产生高质量的边缘感知输出，并在多个基准测试中取得了显着的结果，演示了所提出的方法在可接受的计算预算下提供了极大的灵活性。

StereoNet: 实时边缘感知深度预测的引导分层细化