This paper proposes a new deep convolutional neural network (DCNN) architecture for learning semantic segmentation. The main idea is to train the DCNN to produce internal representations that respect object boundaries. That is, for any two pixels on the same object, the DCNN is trained to produce nearly-identical internal representations; conversely, the DCNN is trained to produce dissimilar representations for pixels coming from differing objects. This strategy is complementary to many others pursued in semantic segmentation, making its integration with existing systems very straightforward. Experimental results show that when this approach is combined with a pre-trained state-of-the-art segmentation system, per-pixel classification accuracy improves, and the resulting segmentations are qualitatively sharper. When combined with a dense conditional random field, this approach exceeds the prior state-of-the-art on the PASCAL VOC2012 segmentation task. Further experiments show that the internal representations learned by the network make state-of-the-art features for patch-based stereo correspondence and motion tracking.

本文提出了基于像素 embeddings 的深层卷积神经网络，利用嵌入学习像素之间的距离来推断像素是否属于同一区域，并证明其与 DCNN 结合使用可以显著提高每个像素分类的准确性。

学习密集卷积嵌入用于语义分割