Domain generalization (DG) remains a significant challenge for perception based on deep neural networks (DNN), where domain shifts occur due to lighting, weather, or geolocation changes. In this work, we propose VLTSeg to enhance domain generalization in semantic segmentation, where the network is solely trained on the source domain and evaluated on unseen target domains. Our method leverages the inherent semantic robustness of vision-language models. First, by substituting traditional vision-only backbones with pre-trained encoders from CLIP and EVA-CLIP as transfer learning setting we find that in the field of DG, vision-language pre-training significantly outperforms supervised and self-supervised vision pre-training. We thus propose a new vision-language approach for domain generalized segmentation, which improves the domain generalization SOTA by 7.6% mIoU when training on the synthetic GTA5 dataset. We further show the superior generalization capabilities of vision-language segmentation models by reaching 76.48% mIoU on the popular Cityscapes-to-ACDC benchmark, outperforming the previous SOTA approach by 6.9% mIoU on the test set at the time of writing. Additionally, our approach shows strong in-domain generalization capabilities indicated by 86.1% mIoU on the Cityscapes test set, resulting in a shared first place with the previous SOTA on the current leaderboard at the time of submission.

本研究提出了一种基于视觉-语言模型的视觉语义分割方法，通过在源领域进行训练并在未见目标领域进行评估，提高了领域通用性。实验证明，该方法在域通用分割中的性能优于传统的视觉训练方法，取得了7.6% mIoU的提升。同时，在主流数据集上取得了76.48% mIoU的性能，超过了此前最优方法6.9% mIoU的水平。还表明该方法在领域内具有强大的泛化能力，并在当前排行榜上与最优方法并列第一。

VLTSeg: 用于领域泛化语义分割的基于CLIP的视觉-语言表示简单转移