Diffusion models are the state of the art in text-to-image generation, but their perceptual variability remains understudied. In this paper, we examine how prompts affect image variability in black-box diffusion-based models. We propose W1KP, a human-calibrated measure of variability in a set of images, bootstrapped from existing image-pair perceptual distances. Current datasets do not cover recent diffusion models, thus we curate three test sets for evaluation. Our best perceptual distance outperforms nine baselines by up to 18 points in accuracy, and our calibration matches graded human judgements 78% of the time. Using W1KP, we study prompt reusability and show that Imagen prompts can be reused for 10-50 random seeds before new images become too similar to already generated images, while Stable Diffusion XL and DALL-E 3 can be reused 50-200 times. Lastly, we analyze 56 linguistic features of real prompts, finding that the prompt's length, CLIP embedding norm, concreteness, and word senses influence variability most. As far as we are aware, we are the first to analyze diffusion variability from a visuolinguistic perspective. Our project page is at http://w1kp.com

基于扩散模型的文本到图像生成是现阶段的最先进技术，本研究通过对提示语的影响来探究黑盒扩散模型中的图像可变性，提出了W1KP人工校准的图像可变性度量方法，评估了新型扩散模型的性能。发现基于W1KP方法，在精确性方面胜过其他九个基线模型最高达18个点，且人工校准结果与人类判断78%的一致性。同时，利用W1KP，研究了提示语的可重用性，表明Imagen提示语可重复利用10-50次，Stable Diffusion XL和DALL-E3可以重复利用50-200次。最后，通过分析真实提示语的56个语言特征，发现提示语的长度、CLIP嵌入向量范数、具象度和词义影响图像的可变性。据我们所知，本研究是首个从视觉语言角度分析扩散可变性的研究。详细信息请参考项目页面：http URL

文字胜过千言万语：衡量和理解文本到图像生成中的知觉变异性