We address the problem of estimating depth with multi modal audio visual
data. Inspired by the ability of animals, such as bats and dolphins, to infer
distance of objects with echolocation, some recent methods have utilized echoes
for depth estimation. We propose an end-to-end deep learning based pipeline
utilizing RGB images, binaural echoes and estimated material properties of
various objects within a scene. We argue that the relation between image,
echoes and depth, for different scene elements, is greatly influenced by the
properties of those elements, and a method designed to leverage this
information can lead to significantly improved depth estimation from audio
visual inputs. We propose a novel multi modal fusion technique, which
incorporates the material properties explicitly while combining audio (echoes)
and visual modalities to predict the scene depth. We show empirically, with
experiments on Replica dataset, that the proposed method obtains 28%
improvement in RMSE compared to the state-of-the-art audio-visual depth
prediction method. To demonstrate the effectiveness of our method on larger
dataset, we report competitive performance on Matterport3D, proposing to use it
as a multimodal depth prediction benchmark with echoes for the first time. We
also analyse the proposed method with exhaustive ablation experiments and
qualitative results. The code and models are available at
this https URL

提出一种基于端到端深度学习的多模式融合技术，通过利用 RGB 图像、双耳回响和场景中不同物体的材料属性来改进音视频输入的场景深度估计，实验证明该方法在 Replica 数据集上比最先进的音视频深度预测方法提高了 28% 的 RMSE，并在 Matterport3D 上表现出了与竞争者相当的性能。