People from different social and demographic groups express diverse
perspectives and conflicting opinions on a broad set of topics such as product
reviews, healthcare, law, and politics. A fair summary should provide a
comprehensive coverage of diverse perspectives without underrepresenting
certain groups. However, current work in summarization metrics and Large
Language Models (LLMs) evaluation has not explored fair abstractive
summarization. In this paper, we systematically investigate fair abstractive
summarization for user-generated data. We first formally define fairness in
abstractive summarization as not underrepresenting perspectives of any groups
of people and propose four reference-free automatic metrics measuring the
differences between target and source perspectives. We evaluate five LLMs,
including three GPT models, Alpaca, and Claude, on six datasets collected from
social media, online reviews, and recorded transcripts. Experiments show that
both the model-generated and the human-written reference summaries suffer from
low fairness. We conduct a comprehensive analysis of the common factors
influencing fairness and propose three simple but effective methods to
alleviate unfair summarization. Our dataset and code are available at
this https URL

我们对用户生成数据进行了公平抽象概括的系统研究，首次正式定义了公平的抽象概括，并提出了四个参考无关的自动评估指标来测量目标和源观点之间的差异。实验证明，无论是模型生成的还是人工编写的参考概括都存在公平性较低的问题，我们提出了三种简单但有效的方法来缓解不公平的概括。

多元观点的公平抽象摘要

Fair Abstractive Summarization of Diverse Perspectives

Large language models (LLMs) are gaining increasing popularity in both
academia and industry, owing to their unprecedented performance in various
applications. As LLMs continue to play a vital role in both research and daily
use, their evaluation becomes increasingly critical, not only at the task
level, but also at the society level for better understanding of their
potential risks. Over the past years, significant efforts have been made to
examine LLMs from various perspectives. This paper presents a comprehensive
review of these evaluation methods for LLMs, focusing on three key dimensions:
what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide
an overview from the perspective of evaluation tasks, encompassing general
natural language processing tasks, reasoning, medical usage, ethics,
educations, natural and social sciences, agent applications, and other areas.
Secondly, we answer the `where' and `how' questions by diving into the
evaluation methods and benchmarks, which serve as crucial components in
assessing performance of LLMs. Then, we summarize the success and failure cases
of LLMs in different tasks. Finally, we shed light on several future challenges
that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to
researchers in the realm of LLMs evaluation, thereby aiding the development of
more proficient LLMs. Our key point is that evaluation should be treated as an
essential discipline to better assist the development of LLMs. We consistently
maintain the related open-source materials at:
this https URL

大语言模型（LLMs）的评估方法是研究这些模型的重要组成部分，这篇综述介绍了评估 LLMs 的方法和维度，并总结了 LLMs 在不同任务中的成功案例、失败案例和未来挑战。