N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely
utilized across a range of natural language generation (NLG) tasks. However,
recent studies have revealed a weak correlation between these matching-based
metrics and human evaluations, especially when compared with neural-based
metrics like BLEURT. In this paper, we conjecture that the performance
bottleneck in matching-based metrics may be caused by the limited diversity of
references. To address this issue, we propose to utilize \textit{multiple
references} to enhance the consistency between these metrics and human
evaluations. Within the WMT Metrics benchmarks, we observe that the
multi-references F200spBLEU surpasses the conventional single-reference one by
an accuracy improvement of 7.2\%. Remarkably, it also exceeds the neural-based
BERTscore by an accuracy enhancement of 3.9\%. Moreover, we observe that the
data leakage issue in large language models (LLMs) can be mitigated to a large
extent by our multi-reference metric. We release the code and data at
https://github.com/SefaZeng/LLM-Ref

N-gram 匹配评估指标，如 BLEU 和 chrF，在各种自然语言生成（NLG）任务中被广泛使用。然而，最近的研究发现，这些基于匹配的指标与人类评估之间存在较弱的相关性，尤其与 BLEURT 等基于神经网络的指标相比。在本文中，我们假设匹配指标的性能瓶颈可能是由于参考文献的多样性有限所致。为了解决这个问题，我们提出利用多个参考文献来增强这些指标与人类评估之间的一致性。在 WMT Metrics 基准测试中，我们观察到多参考文献的 F200spBLEU 比传统的单参考文献提高了 7.2％的准确度，而且它还超过了基于神经网络的 BERTscore 3.9％的准确度提升。此外，我们观察到大型语言模型（LLMs）中的数据泄漏问题在很大程度上可以通过我们的多参考文献指标得到缓解。我们在 https://github.com/SefaZeng/LLM-Ref 上发布了代码和数据。

走向多参考时代 -- 解决自然语言生成评估中的数据泄漏和参考多样性受限问题

Towards Multiple References Era -- Addressing Data Leakage and Limited  Reference Diversity in NLG Evaluation

Several neural-based metrics have been recently proposed to evaluate machine
translation quality. However, all of them resort to point estimates, which
provide limited information at segment level. This is made worse as they are
trained on noisy, biased and scarce human judgements, often resulting in
unreliable quality predictions. In this paper, we introduce uncertainty-aware
MT evaluation and analyze the trustworthiness of the predicted quality. We
combine the COMET framework with two uncertainty estimation methods, Monte
Carlo dropout and deep ensembles, to obtain quality scores along with
confidence intervals. We compare the performance of our uncertainty-aware MT
evaluation methods across multiple language pairs from the QT21 dataset and the
WMT20 metrics task, augmented with MQM annotations. We experiment with varying
numbers of references and further discuss the usefulness of uncertainty-aware
quality estimation (without references) to flag possibly critical translation
mistakes.

本研究介绍了一种基于神经网络度量的机器翻译质量不确定性评估方法，并结合蒙特卡罗 dropout 和深度集成等两种不确定度估计方法，得出质量分数以及置信区间。通过对来自 QT21 数据集和 WMT20 度量任务的多语种数据进行实验，验证了该方法的性能，进一步探讨了不依赖参考文献的不确定性评估在发现可能的翻译错误中的应用。