Recently, Text-to-Image (T2I) generation models have achieved significant advancements. Correspondingly, many automated metrics have emerged to evaluate the image-text alignment capabilities of generative models. However, the performance comparison among these automated metrics is limited by existing small datasets. Additionally, these datasets lack the capacity to assess the performance of automated metrics at a fine-grained level. In this study, we contribute an EvalMuse-40K benchmark, gathering 40K image-text pairs with fine-grained human annotations for image-text alignment-related tasks. In the construction process, we employ various strategies such as balanced prompt sampling and data re-annotation to ensure the diversity and reliability of our benchmark. This allows us to comprehensively evaluate the effectiveness of image-text alignment metrics for T2I models. Meanwhile, we introduce two new methods to evaluate the image-text alignment capabilities of T2I models: FGA-BLIP2 which involves end-to-end fine-tuning of a vision-language model to produce fine-grained image-text alignment scores and PN-VQA which adopts a novel positive-negative VQA manner in VQA models for zero-shot fine-grained evaluation. Both methods achieve impressive performance in image-text alignment evaluations. We also use our methods to rank current AIGC models, in which the results can serve as a reference source for future study and promote the development of T2I generation. The data and code will be made publicly available.

本研究解决了自动评估文本到图像生成模型性能时，现有小型数据集不足的问题，特别是在精细评估方面。我们提出了EvalMuse-40K基准，收集了40K个带有细粒度人类注释的图像-文本对，提供了一种多样的评估方式，同时引入了两种新的评估方法，显著提升了图像-文本对齐能力的评估效果。该工作为未来的生成模型研究提供了重要参考，促进了文本到图像生成的进展。

EvalMuse-40K：一个可靠且精细的基准，包含人类全面注释，用于文本到图像生成模型评估