Most of the latest top semantic segmentation approaches are based on vision Transformers, particularly DETR-like frameworks, which employ a set of queries in the Transformer decoder. Each query is composed of a content query that preserves semantic information and a positional query that provides positional guidance for aggregating the query-specific context. However, the positional queries in the Transformer decoder layers are typically represented as fixed learnable weights, which often encode dataset statistics for segments and can be inaccurate for individual samples. Therefore, in this paper, we propose to generate positional queries dynamically conditioned on the cross-attention scores and the localization information of the preceding layer. By doing so, each query is aware of its previous focus, thus providing more accurate positional guidance and encouraging the cross-attention consistency across the decoder layers. In addition, we also propose an efficient way to deal with high-resolution cross-attention by dynamically determining the contextual tokens based on the low-resolution cross-attention maps to perform local relation aggregation. Our overall framework termed FASeg (Focus-Aware semantic Segmentation) provides a simple yet effective solution for semantic segmentation. Extensive experiments on ADE20K and Cityscapes show that our FASeg achieves state-of-the-art performance, e.g., obtaining 48.3% and 49.6% mIoU respectively for single-scale inference on ADE20K validation set with ResNet-50 and Swin-T backbones, and barely increases the computation consumption from Mask2former. Source code will be made publicly available at https://github.com/zip-group/FASeg.

本文提出了一种名为DFPQ（Dynamic Focus-aware Positional Queries）的query设计方法，该方法基于前一个解码器块的跨注意力得分和相应图像特征的位置编码动态生成位置查询，同时通过仅基于低分辨率跨注意力分数聚合上下文令牌以执行局部关系聚合。经实验证明，在ADE20K和Cityscapes数据集上，该方法在Mask2former的基础上实现了SOTA表现，且ResNet-50、Swin-T和Swin-B等背骨在ADE20K验证集上的单尺度mIoU分别优于Mask2former 1.1％，1.9％和1.1％。

面向语义分割的动态焦点感知位置查询