Misalignment between the outputs of a vision-language (VL) model and task goal hinders its deployment. This issue can worsen when there are distribution shifts between the training and test data. To address this problem, prevailing fully test-time adaptation~(TTA) methods bootstrap themselves through entropy minimization. However, minimizing the entropy of the predictions makes the model overfit to incorrect output distributions of itself. In this work, we propose TTA with feedback to avoid such overfitting and align the model with task goals. Specifically, we adopt CLIP as reward model to provide feedback for VL models during test time in various tasks, including image classification, image-text retrieval, and image captioning. Given a single test sample, the model aims to maximize CLIP reward through reinforcement learning. We adopt a reward design with the average CLIP score of sampled candidates as the baseline. This design is simple and surprisingly effective when combined with various task-specific sampling strategies. The entire system is flexible, allowing the reward model to be extended with multiple CLIP models. Plus, a momentum buffer can be used to memorize and leverage the learned knowledge from multiple test samples. Extensive experiments demonstrate that our method significantly improves different VL models after TTA.

提出一种测试时反馈方法来解决视觉-语言模型的输出与任务目标之间的不匹配问题，以避免模型过拟合于其不正确的输出分布。具体而言，采用CLIP作为奖励模型，在不同的任务中，包括图像分类、图像文本检索和图像标题生成等。通过强化学习，以最大化CLIP奖励为目标来进行单一测试样本的训练。经过广泛实验，证明了这种测试时反馈方法可以显著提高不同的视觉-语言模型的结果。

视觉语言模型零样本泛化的测试时间自适应与CLIP奖励