The zero-shot capabilities of Vision-Language Models (VLMs) have been widely leveraged to improve predictive performance. However, previous works on transductive or test-time adaptation (TTA) often make strong assumptions about the data distribution, such as the presence of all classes. Our work challenges these favorable deployment scenarios, and introduces a more realistic evaluation framework, including: (i) a variable number of effective classes for adaptation within a single batch, and (ii) non-i.i.d. batches of test samples in online adaptation settings. We provide comprehensive evaluations, comparisons, and ablation studies that demonstrate how current transductive or TTA methods for VLMs systematically compromise the models' initial zero-shot robustness across various realistic scenarios, favoring performance gains under advantageous assumptions about the test samples' distributions. Furthermore, we introduce StatA, a versatile method that could handle a wide range of deployment scenarios, including those with a variable number of effective classes at test time. Our approach incorporates a novel regularization term designed specifically for VLMs, which acts as a statistical anchor preserving the initial text-encoder knowledge, particularly in low-data regimes. Code available at https://github.com/MaxZanella/StatA.

本研究针对视觉语言模型(VLMs)在测试时适应中的假设强烈问题进行了探讨。我们提出了一种名为StatA的新方法，能够应对具有可变有效类别数量的适应场景，并通过引入特定于VLMs的正则化项来增强初始文本编码知识的保持。该方法在各种实际场景下显示出改进的适应能力，并证明现有方法在假设测试样本分布有利的情况下，往往削弱模型的零-shot 鲁棒性。

真实的测试时适应视觉语言模型