Recent advances in foundation models present new opportunities for interpretable visual recognition -- one can first query Large Language Models (LLMs) to obtain a set of attributes that describe each class, then apply vision-language models to classify images via these attributes. Pioneering work shows that querying thousands of attributes can achieve performance competitive with image features. However, our further investigation on 8 datasets reveals that LLM-generated attributes in a large quantity perform almost the same as random words. This surprising finding suggests that significant noise may be present in these attributes. We hypothesize that there exist subsets of attributes that can maintain the classification performance with much smaller sizes, and propose a novel learning-to-search method to discover those concise sets of attributes. As a result, on the CUB dataset, our method achieves performance close to that of massive LLM-generated attributes (e.g., 10k attributes for CUB), yet using only 32 attributes in total to distinguish 200 bird species. Furthermore, our new paradigm demonstrates several additional benefits: higher interpretability and interactivity for humans, and the ability to summarize knowledge for a recognition task.

最近基础模型的进展为可解释的视觉识别提供了新的机会，通过查询大型语言模型获取描述每个类别的属性，然后应用视觉语言模型通过这些属性对图像进行分类，我们的研究发现，大量的LLM生成的属性与随机词几乎没有差别，我们提出了一种新的学习搜索方法来发现那些简明的属性集，该方法在CUB数据集上使用仅32个属性来区分200个鸟类的性能接近于大量LLM生成的属性（例如CUB的10,000个属性），此外，我们的新范式还展示了几个附加优势：人类的更高可解释性和互动性，以及总结知识的能力。

学习简洁和描述性的视觉识别属性