N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely
utilized across a range of natural language generation (NLG) tasks. However,
recent studies have revealed a weak correlation between these matching-based
metrics and human evaluations, especially when compared with neural-based
metrics like BLEURT. In this paper, we conjecture that the performance
bottleneck in matching-based metrics may be caused by the limited diversity of
references. To address this issue, we propose to utilize \textit{multiple
references} to enhance the consistency between these metrics and human
evaluations. Within the WMT Metrics benchmarks, we observe that the
multi-references F200spBLEU surpasses the conventional single-reference one by
an accuracy improvement of 7.2\%. Remarkably, it also exceeds the neural-based
BERTscore by an accuracy enhancement of 3.9\%. Moreover, we observe that the
data leakage issue in large language models (LLMs) can be mitigated to a large
extent by our multi-reference metric. We release the code and data at
https://github.com/SefaZeng/LLM-Ref

N-gram 匹配评估指标，如 BLEU 和 chrF，在各种自然语言生成（NLG）任务中被广泛使用。然而，最近的研究发现，这些基于匹配的指标与人类评估之间存在较弱的相关性，尤其与 BLEURT 等基于神经网络的指标相比。在本文中，我们假设匹配指标的性能瓶颈可能是由于参考文献的多样性有限所致。为了解决这个问题，我们提出利用多个参考文献来增强这些指标与人类评估之间的一致性。在 WMT Metrics 基准测试中，我们观察到多参考文献的 F200spBLEU 比传统的单参考文献提高了 7.2％的准确度，而且它还超过了基于神经网络的 BERTscore 3.9％的准确度提升。此外，我们观察到大型语言模型（LLMs）中的数据泄漏问题在很大程度上可以通过我们的多参考文献指标得到缓解。我们在 https://github.com/SefaZeng/LLM-Ref 上发布了代码和数据。