Recent progress in generative models has resulted in models that produce both realistic as well as relevant images for most textual inputs. These models are being used to generate millions of images everyday, and hold the potential to drastically impact areas such as generative art, digital marketing and data augmentation. Given their outsized impact, it is important to ensure that the generated content reflects the artifacts and surroundings across the globe, rather than over-representing certain parts of the world. In this paper, we measure the geographical representativeness of common nouns (e.g., a house) generated through DALL.E 2 and Stable Diffusion models using a crowdsourced study comprising 540 participants across 27 countries. For deliberately underspecified inputs without country names, the generated images most reflect the surroundings of the United States followed by India, and the top generations rarely reflect surroundings from all other countries (average score less than 3 out of 5). Specifying the country names in the input increases the representativeness by 1.44 points on average for DALL.E 2 and 0.75 for Stable Diffusion, however, the overall scores for many countries still remain low, highlighting the need for future models to be more geographically inclusive. Lastly, we examine the feasibility of quantifying the geographical representativeness of generated images without conducting user studies.

本文中，我们采用众包研究来衡量使用 DALL.E 2 和 Stable Diffusion 模型对普通名词进行生成时，生成图像对全球各地表现力的代表性。我们发现，对于没有特定国名的输入，生成的图像表现美国和印度的环境最好，其他国家的表现不如其它的。如果在输入中指定国家名称，则 DALL.E 2 的表现提升了1.44分，Stable Diffusion的表现提升了0.75分，但许多国家的总体得分仍然很低，需要未来的模型在地理上更加包容。最后，我们研究了在不进行用户研究的情况下量化生成图像在地理上的代表性的可行性。

检验文图生成模型图像地理代表性