Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concepts by design. There are recent finetuning methods, such as prompt learning, that not only study the discrimination between in-distribution (ID) and out-of-distribution (OOD) samples, but also show some improvements in both ID and OOD accuracies. In this paper, we first demonstrate that vision-language models, after long enough finetuning but without proper regularization, tend to overfit the known classes in the given dataset, with degraded performance on unknown classes. Then we propose a novel approach OGEN to address this pitfall, with the main focus on improving the OOD GENeralization of finetuned models. Specifically, a class-conditional feature generator is introduced to synthesize OOD features using just the class name of any unknown class. Such synthesized features will provide useful knowledge about unknowns and help regularize the decision boundary between ID and OOD data when optimized jointly. Equally important is our adaptive self-distillation mechanism to regularize our feature generation model during joint optimization, i.e., adaptively transferring knowledge between model states to further prevent overfitting. Experiments validate that our method yields convincing gains in OOD generalization performance in different settings.

在本论文中，我们首先展示了，经过足够时间的微调但没有适当的正则化，视觉-语言模型在给定数据集中往往会过度拟合已知类别，导致对未知类别的表现下降。然后，我们提出了一种新颖的方法OGEN来解决这个问题，在关注点主要是改进经过微调模型的未知类别（OOD）泛化能力。具体而言，我们引入了一种类条件特征生成器，通过仅使用任何未知类别的类名，合成OOD特征。这些合成特征将提供关于未知类别的有用知识，并在联合优化时有助于规范ID和OOD数据之间的决策边界。同样重要的是，我们的自适应自蒸馏机制用于规范特征生成模型，在联合优化期间自适应地传递模型状态之间的知识，以进一步防止过度拟合。实验证实，我们的方法在不同设置下提供了令人信服的OOD泛化性能增益。

克服视觉语言模型微调的问题：针对OOD泛化