Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. Despite the success, most traditional VLMs-based methods are restricted by the assumption of partial source supervision or ideal vocabularies, which rarely satisfy the open-world scenario. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. To address this challenge, we propose the Self Structural Semantic Alignment (S^3A) framework, which extracts the structural semantic information from unlabeled data while simultaneously self-learning. Our S^3A framework adopts a unique Cluster-Vote-Prompt-Realign (CVPR) algorithm, which iteratively groups unlabeled data to derive structural semantics for pseudo-supervision. Our CVPR process includes iterative clustering on images, voting within each cluster to identify initial class candidates from the vocabulary, generating discriminative prompts with large language models to discern confusing candidates, and realigning images and the vocabulary as structural semantic alignment. Finally, we propose to self-learn the CLIP image encoder with both individual and structural semantic alignment through a teacher-student learning strategy. Our comprehensive experiments across various generic and fine-grained benchmarks demonstrate that the S^3A method offers substantial improvements over existing VLMs-based approaches, achieving a more than 15% accuracy improvement over CLIP on average. Our codes, models, and prompts are publicly released at https://github.com/sheng-eatamath/S3A.

我们提出了Self Structural Semantic Alignment (S^3A)框架，该框架通过从无标签数据中提取结构语义信息并进行自学习，克服了传统的基于大规模预训练视觉语言模型方法所存在的假设有部分源监督或理想词汇表的限制，通过Cluster-Vote-Prompt-Realign算法实现迭代聚类，利用大语言模型生成辨别性提示来识别混淆的类别候选项，并通过师生学习策略进行自学习，实现了对现实中零样本分类的挑战，多个实验表明该方法明显优于现有的基于VLMs的方法，相对于CLIP平均提高了15%以上的准确性。

通过自我结构语义对齐实现真实零样本分类