Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language Generation (NLG) tasks. However, the capabilities of LLMs in scoring NLG quality remain inadequately explored. Current studies depend on human assessments and simple metrics that fail to capture the discernment of LLMs across diverse NLG tasks. To address this gap, we propose the Discernment of Hierarchical Perturbation (DHP) benchmarking framework, which provides quantitative discernment scores for LLMs utilizing hierarchically perturbed text data and statistical tests to measure the NLG evaluation capabilities of LLMs systematically. We have re-established six evaluation datasets for this benchmark, covering four NLG tasks: Summarization, Story Completion, Question Answering, and Translation. Our comprehensive benchmarking of five major LLM series provides critical insight into their strengths and limitations as NLG evaluators.

本研究针对现有自然语言生成(NLG)评估中缺乏对大型语言模型(LLMs)能力探索的问题，提出了“层次扰动的辨别力(DHP)”基准框架。该框架通过层次扰动文本数据与统计测试，为LLMs提供量化的评估分数。研究发现，LLMs在不同NLG任务中的评估能力存在显著差异，为LLMs作为NLG评估者的优势与局限性提供了重要见解。

DHP基准：大型语言模型是否是良好的自然语言生成评估者？