How well can large language models (LLMs) generate summaries? We develop new
datasets and conduct human evaluation experiments to evaluate the zero-shot
generation capability of LLMs across five distinct summarization tasks. Our
findings indicate a clear preference among human evaluators for LLM-generated
summaries over human-written summaries and summaries generated by fine-tuned
models. Specifically, LLM-generated summaries exhibit better factual
consistency and fewer instances of extrinsic hallucinations. Due to the
satisfactory performance of LLMs in summarization tasks (even surpassing the
benchmark of reference summaries), we believe that most conventional works in
the field of text summarization are no longer necessary in the era of LLMs.
However, we recognize that there are still some directions worth exploring,
such as the creation of novel datasets with higher quality and more reliable
evaluation methods.

大型语言模型在总结任务中表现出令人满意的性能，超过了参考摘要的基准，人类评估者明显偏好大型语言模型生成的摘要而不是人工撰写的摘要和经过微调的模型生成的摘要，因为大型语言模型生成的摘要具有更好的事实连贯性和更少的外在幻觉实例。

总结（几乎）已死

Summarization is (Almost) Dead

This paper introduces the Human Evaluation Datasheet, a template for
recording the details of individual human evaluation experiments in Natural
Language Processing (NLP). Originally taking inspiration from seminal papers by
Bender and Friedman (2018), Mitchell et al. (2019), and Gebru et al. (2020),
the Human Evaluation Datasheet is intended to facilitate the recording of
properties of human evaluations in sufficient detail, and with sufficient
standardisation, to support comparability, meta-evaluation, and reproducibility
tests.

该论文介绍了人类评估数据表格，该表格是记录自然语言处理（NLP）中个别人类评估实验细节的模板。人类评估数据表格旨在促进人类评估特性的记录，以支持可比性，元评估和可重复性测试。