The visual classification performance of vision-language models such as CLIP can benefit from additional semantic knowledge, e.g. via large language models (LLMs) such as GPT-3. Further extending classnames with LLM-generated class descriptors, e.g. ``waffle, \textit{which has a round shape}'', or averaging retrieval scores over multiple such descriptors, has been shown to improve generalization performance. In this work, we study this behavior in detail and propose \texttt{Waffle}CLIP, a framework for zero-shot visual classification which achieves similar performance gains on a large number of visual classification tasks by simply replacing LLM-generated descriptors with random character and word descriptors \textbf{without} querying external models. We extend these results with an extensive experimental study on the impact and shortcomings of additional semantics introduced via LLM-generated descriptors, and showcase how semantic context is better leveraged by automatically querying LLMs for high-level concepts, while jointly resolving potential class name ambiguities. Link to the codebase: https://github.com/ExplainableML/WaffleCLIP.

本文提出了一种名为 WaffleCLIP 的框架，通过简单地替换 LLM 生成的描述符为字符和词串，而无需查询外部模型，就在大量的视觉分类任务中实现类似的性能提升，并通过实验研究了 LLM 生成的描述符引入附加语义的影响和缺陷。

基于随机单词和广泛概念的视觉分类性能探讨