The correlation between NLG automatic evaluation metrics and human evaluation is often regarded as a critical criterion for assessing the capability of an evaluation metric. However, different grouping methods and correlation coefficients result in various types of correlation measures used in meta-evaluation. In specific evaluation scenarios, prior work often directly follows conventional measure settings, but the characteristics and differences between these measures have not gotten sufficient attention. Therefore, this paper analyzes 12 common correlation measures using a large amount of real-world data from six widely-used NLG evaluation datasets and 32 evaluation metrics, revealing that different measures indeed impact the meta-evaluation results. Furthermore, we propose three perspectives that reflect the capability of meta-evaluation and find that the measure using global grouping and Pearson correlation exhibits the best overall performance, involving the discriminative power, ranking consistency, and sensitivity to score granularity.

本研究解决了自然语言生成(NLG)自动评估指标与人工评估之间相关性的差异问题。通过分析12种常见的相关性度量，发现不同的度量方法影响元评估结果，提出了三种反映元评估能力的视角，最终发现采用全局分组和Pearson相关性度量的组合表现最佳，具有较好的区分能力和一致性。

自然语言生成元评估中相关性度量的分析与评估