Evaluating the quality of automatically generated image descriptions is challenging, requiring metrics that capture various aspects such as grammaticality, coverage, correctness, and truthfulness. While human evaluation offers valuable insights, its cost and time-consuming nature pose limitations. Existing automated metrics like BLEU, ROUGE, METEOR, and CIDEr aim to bridge this gap but often show weak correlations with human judgment. We address this challenge by introducing a novel evaluation framework rooted in a modern large language model (LLM), such as GPT-4 or Gemini, capable of image generation. In our proposed framework, we begin by feeding an input image into a designated image captioning model, chosen for evaluation, to generate a textual description. Using this description, an LLM then creates a new image. By extracting features from both the original and LLM-created images, we measure their similarity using a designated similarity metric. A high similarity score suggests that the image captioning model has accurately generated textual descriptions, while a low similarity score indicates discrepancies, revealing potential shortcomings in the model's performance. Human-annotated reference captions are not required in our proposed evaluation framework, which serves as a valuable tool for evaluating the effectiveness of image captioning models. Its efficacy is confirmed through human evaluation.

本研究解决了自动生成图像描述质量评估中的挑战，尤其是现有自动化评估指标与人工判断之间的相关性不足。通过引入一种基于现代大型语言模型（如GPT-4或Gemini）的新评估框架，我们将生成的描述与相应的生成图像进行相似性比较，从而客观评估图像描述模型的有效性。这一方法在无需人工注释的情况下，能够有效评估图像描述的准确性，为相关研究提供了新的工具。 

图像到文本生成的新评价框架