Large vision-language models (VLMs) like CLIP have demonstrated good zero-shot learning performance in the unsupervised domain adaptation task. Yet, most transfer approaches for VLMs focus on either the language or visual branches, overlooking the nuanced interplay between both modalities. In this work, we introduce a Unified Modality Separation (UniMoS) framework for unsupervised domain adaptation. Leveraging insights from modality gap studies, we craft a nimble modality separation network that distinctly disentangles CLIP's features into language-associated and vision-associated components. Our proposed Modality-Ensemble Training (MET) method fosters the exchange of modality-agnostic information while maintaining modality-specific nuances. We align features across domains using a modality discriminator. Comprehensive evaluations on three benchmarks reveal our approach sets a new state-of-the-art with minimal computational costs. Code: https://github.com/TL-UESTC/UniMoS

大型视觉语言模型（VLM）如CLIP在无监督域自适应任务中表现出良好的零样本学习性能，为了充分利用语言和视觉之间微妙的相互作用，本文引入了一种统一的模态分离（UniMoS）框架进行无监督域自适应，通过利用模态间差异研究的见解，我们设计了一种灵活的模态分离网络，将CLIP的特征明确地分解为与语言相关和与视觉相关的部分，我们提出的模态集成训练（MET）方法促进了模态无关信息的交换，同时保留了模态特定的细微差别，通过模态鉴别器在域间进行特征对齐，我们全面评估了三个基准数据集，结果显示我们的方法以极小的计算成本取得了新的最先进水平。

分合：统一分离模式的非监督领域适应