While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct an extensive human study on GenAI-Bench to evaluate the performance of leading image and video generation models in various aspects of compositional text-to-visual generation. We also compare automated evaluation metrics against our collected human ratings and find that VQAScore -- a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt -- significantly outperforms previous metrics such as CLIPScore. In addition, VQAScore can improve generation in a black-box manner (without finetuning) via simply ranking a few (3 to 9) candidate images. Ranking by VQAScore is 2x to 3x more effective than other scoring methods like PickScore, HPSv2, and ImageReward at improving human alignment ratings for DALL-E 3 and Stable Diffusion, especially on compositional prompts that require advanced visio-linguistic reasoning. We will release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate scoring metrics on ranking images generated from the same prompt. Lastly, we discuss promising areas for improvement in VQAScore, such as addressing fine-grained visual details. We will release all human ratings (over 80,000) to facilitate scientific benchmarking of both generative models and automated metrics.

本文通过对GenAI-Bench上的人类评分进行广泛研究，评估领先的图像和视频生成模型在复合文本到视觉生成的各个方面的性能，并发现VQAScore比先前的评估指标（如CLIPScore）明显优于人类评分，而且VQAScore可以在黑盒的基础上通过简单地对候选图像进行排名（3到9张）从而显著提高生成速度，在需要高级视觉语言推理的复合提示下，VQAScore的排名效果比其他评分方法如PickScore、HPSv2和ImageReward提高2倍至3倍。

GenAI-Bench: 评估和改进文本到视觉生成能力