Evaluating image captions typically relies on reference captions, which are costly to obtain and exhibit significant diversity and subjectivity. While reference-free Evaluation Metrics have been proposed, most focus on cross-modal evaluation between captions and images. Recent research