The zero-shot performance of existing vision-language models (VLMs) such as CLIP is limited by the availability of large-scale, aligned image and text datasets in specific domains. In this work, we leverage two complementary sources of information -- descriptions of categories generated by large language models (LLMs) and abundant, fine-grained image classification datasets -- to improve the zero-shot classification performance of VLMs across fine-grained domains. On the technical side, we develop methods to train VLMs with this "bag-level" image-text supervision. We find that simply using these attributes at test-time does not improve performance, but our training strategy, for example, on the iNaturalist dataset, leads to an average improvement of 4-5% in zero-shot classification accuracy for novel categories of birds and flowers. Similar improvements are observed in domains where a subset of the categories was used to fine-tune the model. By prompting LLMs in various ways, we generate descriptions that capture visual appearance, habitat, and geographic regions and pair them with existing attributes such as the taxonomic structure of the categories. We systematically evaluate their ability to improve zero-shot categorization in natural domains. Our findings suggest that geographic priors can be just as effective and are complementary to visual appearance. Our method also outperforms prior work on prompt-based tuning of VLMs. We plan to release the benchmark, consisting of 7 datasets, which will contribute to future research in zero-shot recognition.

通过使用大型语言模型（LLMs）生成的类别描述和丰富的细粒度图像分类数据集，我们提出了一种方法来改善视觉-语言模型（VLMs）在细粒度领域的零样本分类性能。通过在训练过程中利用图像-文本监督，我们的方法在鸟类和花卉等新颖类别的零样本分类准确度上平均提高了4-5％。地理先验也被证明对于改善零样本分类同样有效，与视觉特征互补。我们计划发布包含7个数据集的基准测试，以促进未来的零样本识别研究。

通过使用文本描述使VLMs适应性更好的零射分类改进