Fine-grained image classification, particularly in zero/few-shot scenarios, presents a significant challenge for vision-language models (VLMs), such as CLIP. These models often struggle with the nuanced task of distinguishing between semantically similar classes due to limitations in their pre-trained recipe, which lacks supervision signals for fine-grained categorization. This paper introduces CascadeVLM, an innovative framework that overcomes the constraints of previous CLIP-based methods by effectively leveraging the granular knowledge encapsulated within large vision-language models (LVLMs). Experiments across various fine-grained image datasets demonstrate that CascadeVLM significantly outperforms existing models, specifically on the Stanford Cars dataset, achieving an impressive 85.6% zero-shot accuracy. Performance gain analysis validates that LVLMs produce more accurate predictions for challenging images that CLIPs are uncertain about, bringing the overall accuracy boost. Our framework sheds light on a holistic integration of VLMs and LVLMs for effective and efficient fine-grained image classification.

本研究介绍了CascadeVLM，一种创新的框架，通过有效地利用大型视觉-语言模型（LVLMs）内固有的精细知识，克服了以前基于CLIP的方法的限制。在各种细粒度图像数据集上的实验表明，CascadeVLM在Stanford Cars数据集上显著优于现有模型，达到了令人印象深刻的85.6%的零样本准确性。性能增益分析验证了LVLM对于CLIP不确定的复杂图像的更准确预测，从而提高了整体准确性。我们的框架为有效和高效的细粒度图像分类提供了VLM与LVLM的整体集成方法。

通过级联视觉语言模型提升细粒度图像分类