Most research about natural language generation (NLG) relies on evaluation benchmarks with limited references for a sample, which may result in poor correlations with human judgements. The underlying reason is that one semantic meaning can actually be expressed in different forms, and the evaluation with a single or few references may not accurately reflect the quality of the model's hypotheses. To address this issue, this paper presents a novel method, named Para-Ref, to enhance existing evaluation benchmarks by enriching the number of references. We leverage large language models (LLMs) to paraphrase a single reference into multiple high-quality ones in diverse expressions. Experimental results on representative NLG tasks of machine translation, text summarization, and image caption demonstrate that our method can effectively improve the correlation with human evaluation for sixteen automatic evaluation metrics by +7.82% in ratio. We release the code and data at https://github.com/RUCAIBox/Para-Ref.

本文提出了 Para-Ref，一种通过利用大型语言模型进行重新创作来增强现有自然语言生成评估基准的新方法，并在机器翻译、文本摘要和图像标题等任务中的实验结果表明，该方法能够通过多个高质量的参考文本使人工评估结果与16种自动评估指标之间的相关度提高了7.82%。

不是所有指标都有罪：利用LLM改进NLG评估的修辞转换技术