Semantic labeling of RGB-D scenes is crucial to many intelligent applications including perceptual robotics. It generates pixelwise and fine-grained label maps from simultaneously sensed photometric (RGB) and depth channels. This paper addresses this problem by i) developing a novel Long Short-Term Memorized Fusion (LSTM-F) Model that captures and fuses contextual information from multiple channels of photometric and depth data, and ii) incorporating this model into deep convolutional neural networks (CNNs) for end-to-end training. Specifically, global contexts in photometric and depth channels are, respectively, captured by stacking several convolutional layers and a long short-term memory layer; the memory layer encodes both short-range and long-range spatial dependencies in an image along the vertical direction. Another long short-term memorized fusion layer is set up to integrate the contexts along the vertical direction from different channels, and perform bi-directional propagation of the fused vertical contexts along the horizontal direction to obtain true 2D global contexts. At last, the fused contextual representation is concatenated with the convolutional features extracted from the photometric channels in order to improve the accuracy of fine-scale semantic labeling. Our proposed model has set a new state of the art, i.e., 48.1% average class accuracy over 37 categories 11.8% improvement), on the large-scale SUNRGBD dataset.1

本文开发了一种新型的LSTM-CF模型，它可以从多个光度和深度数据通道捕获和融合上下文信息，并将此模型纳入深度卷积神经网络（CNNs），以用于端到端训练，以提高细粒度语义标签的准确性。

LSTM-CF: 基于LSTM的RGB-D场景标记中上下文建模和融合的统一