Long-range dependencies can capture useful contextual information to benefit visual understanding problems. In this work, we propose a Criss-Cross Network (CCNet) for obtaining such important information through a more effective and efficient way. Concretely, for each pixel, our CCNet can harvest the contextual information of its surrounding pixels on the criss-cross path through a novel criss-cross attention module. By taking a further recurrent operation, each pixel can finally capture the long-range dependencies from all pixels. Overall, our CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the recurrent criss-cross attention module requires $11\times$ less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85\% of the non-local block in computing long-range dependencies. 3) The state-of-the-art performance. We conduct extensive experiments on popular semantic segmentation benchmarks including Cityscapes, ADE20K, and instance segmentation benchmark COCO. In particular, our CCNet achieves the mIoU score of 81.4 and 45.22 on Cityscapes test set and ADE20K validation set, respectively, which are the new state-of-the-art results. We make the code publicly available at \url{https://github.com/speedinghzl/CCNet .

提出使用Criss-Cross网络来获取图像的上下文信息，通过使用新的Criss-Cross attention模块，可以收集其交叉路径上所有像素的上下文信息，并且使用循环操作可以使每个像素最终捕获整个图像的依赖关系，并提出类别一致性损失以促进该模块产生更具有鉴别性的特征。CCNet的优点有：1）GPU内存友好性。与非本地块相比，所提出的循环Criss-Cross attention模块需要11倍的GPU内存使用。2）高计算效率。循环Criss-Cross attention可以将FLOPs显著减少约85%。3）达到了最先进的性能， 在语义分割基准测试包括Cityscapes，ADE20K，人体解析基准测试LIP，实例分割基准测试COCO，视频分割基准CamVid上都经过了广泛的实验，我们的CCNet特别是在Cityscapes测试集上获得了mIoU分数为81.9％的新的最先进结果，ADE20K验证集和LIP验证集分别是45.76％和55.47％。

CCNet：用于语义分割的交叉注意力