The excellent generative capabilities of text-to-image diffusion models suggest they learn informative representations of image-text data. However, what knowledge their representations capture is not fully understood, and they have not been thoroughly explored on downstream tasks. We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers. The key idea is using a diffusion model's ability to denoise a noised image given a text description of a label as a proxy for that label's likelihood. We apply our method to Imagen, using it to probe fine-grained aspects of Imagen's knowledge and comparing it with CLIP's zero-shot abilities. Imagen performs competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, it achieves state-of-the-art results on shape/texture bias tests and can successfully perform attribute binding while CLIP cannot. Although generative pre-training is prevalent in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for vision and vision-language problems.

通过使用扩散模型的去噪能力作为代理，将零样本分类器应用于Imagen，探究其知识方面并与CLIP进行比较，结果显示Imagen与CLIP在零样本图像分类方面表现相当，同时在形状/纹理偏差测试方面取得了最先进的结果，能够成功地执行属性绑定，而CLIP则不能。因此，我们认为应该探索将生成预训练作为一种有吸引力的视觉和视觉语言问题的替代方法。

文本到图像扩散模型是零样本分类器