In this position paper, we argue that human evaluation of generative large
language models (LLMs) should be a multidisciplinary undertaking that draws
upon insights from disciplines such as user experience research and human
behavioral psychology to ensure that the experimental design and results are
reliable. The conclusions from these evaluations, thus, must consider factors
such as usability, aesthetics, and cognitive biases. We highlight how cognitive
biases can conflate fluent information and truthfulness, and how cognitive
uncertainty affects the reliability of rating scores such as Likert.
Furthermore, the evaluation should differentiate the capabilities and
weaknesses of increasingly powerful large language models -- which requires
effective test sets. The scalability of human evaluation is also crucial to
wider adoption. Hence, to design an effective human evaluation system in the
age of generative NLP, we propose the ConSiDERS-The-Human evaluation framework
consisting of 6 pillars --Consistency, Scoring Critera, Differentiating, User
Experience, Responsible, and Scalability.

通过借鉴用户体验研究和人类行为心理学等学科的见解，我们在这篇论文中讨论了生成式大型语言模型（LLMs）的人工评估应该是一项跨学科工作，以确保实验设计和结果的可靠性。我们强调了认知偏见如何混淆流畅信息和真实性，以及认知不确定性如何影响评分（如 Likert）的可靠性。此外，评估应该区分越来越强大的大型语言模型的能力和弱点，这需要有效的测试集。在生成式 NLP 时代设计一个有效的人工评估系统的可伸缩性也至关重要，因此我们提出了 ConSiDERS-The-Human 评估框架，它由一致性、评分标准、差异化、用户体验、负责任和可伸缩性这 6 个支柱组成。