Vision transformers (ViT) usually extract features via forwarding all the tokens in the self-attention layers from top to toe. In this paper, we introduce dynamic token-pass vision transformers (DoViT) for semantic segmentation, which can adaptively reduce the inference cost for images with different complexity. DoViT gradually stops partial easy tokens from self-attention calculation and keeps the hard tokens forwarding until meeting the stopping criteria. We employ lightweight auxiliary heads to make the token-pass decision and divide the tokens into keeping/stopping parts. With a token separate calculation, the self-attention layers are speeded up with sparse tokens and still work friendly with hardware. A token reconstruction module is built to collect and reset the grouped tokens to their original position in the sequence, which is necessary to predict correct semantic masks. We conduct extensive experiments on two common semantic segmentation tasks, and demonstrate that our method greatly reduces about 40% $\sim$ 60% FLOPs and the drop of mIoU is within 0.8% for various segmentation transformers. The throughput and inference speed of ViT-L/B are increased to more than 2$\times$ on Cityscapes.

通过引入动态令牌过渡视觉转换器（DoViT）对图像进行语义分割，适应性地降低了不同复杂度图像的推理成本，通过逐渐停止部分易处理的令牌的自注意计算并保持难处理的令牌继续前进直到满足停止标准，利用轻量级辅助头部做出令牌传递决策并将令牌划分为保留/停止部分，通过令牌的分离计算，使用稀疏令牌加速自注意层，并在硬件上保持友好性，构建令牌重建模块以收集和重置分组令牌到序列中的原始位置，这对于预测正确的语义掩码是必要的，我们在两个常见的语义分割任务上进行了大量实验证明我们的方法在各种分割转换中大大减少了40％〜60％的FLOPs，mIoU的降低在0.8％以内，并且Cityscapes上的ViT-L/B的吞吐量和推理速度增加了2倍以上。

动态令牌传递变换器用于语义分割