Human evaluation is the foundation upon which the evaluation of both
summarization systems and automatic metrics rests. However, existing human
evaluation protocols and benchmarks for summarization either exhibit low
inter-annotator agreement or lack the scale needed to draw statistically
significant conclusions, and an in-depth analysis of human evaluation is
lacking. In this work, we address the shortcomings of existing summarization
evaluation along the following axes: 1) We propose a modified summarization
salience protocol, Atomic Content Units (ACUs), which relies on fine-grained
semantic units and allows for high inter-annotator agreement. 2) We curate the
Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation
dataset consisting of over 22k summary-level annotations over state-of-the-art
systems on three datasets. 3) We compare our ACU protocol with three other
human evaluation protocols, underscoring potential confounding factors in
evaluation setups. 4) We evaluate existing automatic metrics using the
collected human annotations across evaluation protocols and demonstrate how our
benchmark leads to more statistically stable and significant results.
Furthermore, our findings have important implications for evaluating large
language models (LLMs), as we show that LLMs adjusted by human feedback (e.g.,
GPT-3.5) may overfit unconstrained human evaluation, which is affected by the
annotators' prior, input-agnostic preferences, calling for more robust,
targeted evaluation methods.

本文探讨了现有自动摘要的人工评估协议和基准的不足，提出了基于精细语义单元的修改版自动摘要重要性协议（ACU）和大型人工评估数据集（RoSE），并与其他人工评估协议进行了比较，证明了新的基准标注有助于更为稳定和显著的自动度量结果，可用于调整大型语言模型。