We propose a novel framework to address the real-world challenging task of Single Image Test Time Adaptation in an open and dynamic environment. We leverage large scale Vision Language Models like CLIP to enable real time adaptation on a per-image basis without access to source data or ground truth labels. Since the deployed model can also encounter unseen classes in an open world, we first employ a simple and effective Out of Distribution (OOD) detection module to distinguish between weak and strong OOD samples. We propose a novel contrastive learning based objective to enhance the discriminability between weak and strong OOD samples by utilizing small, dynamically updated feature banks. Finally, we also employ a classification objective for adapting the model using the reliable weak OOD samples. The proposed framework ROSITA combines these components, enabling continuous online adaptation of Vision Language Models on a single image basis. Extensive experimentation on diverse domain adaptation benchmarks validates the effectiveness of the proposed framework. Our code can be found at the project site https://manogna-s.github.io/rosita/

我们提出了一个创新框架，用于在开放和动态环境中解决单图测试时间适应的现实挑战。我们利用大规模视觉语言模型（如CLIP）在没有源数据或准确标签的情况下，实现了对每个图像的实时自适应。我们首先使用一个简单而有效的OOD（外分布）检测模块区分弱OOD样本和强OOD样本，以应对部署模型可能遇到的未知类别。通过利用小而动态更新的特征库，我们提出了一种基于对比学习的目标函数，增强了弱OOD样本和强OOD样本之间的可区分性。最后，我们还使用可靠的弱OOD样本来实现模型的分类自适应。提出的ROSITA框架将这些组件结合起来，实现了视觉语言模型的连续在线自适应，且适用于单个图像。通过对不同领域自适应基准的广泛实验，验证了该框架的有效性。我们的代码可以在项目网址中找到（链接已省略）。

视觉语言模型在开放环境下单张图像测试时间自适应的有效性