This research investigates prompt designs of evaluating generated texts using
large language models (LLMs). While LLMs are increasingly used for scoring
various inputs, creating effective prompts for open-ended text evaluation
remains challenging due to model sensitivity and subjectivity in evaluation of
text generation. Our study experimented with different prompt structures,
altering the sequence of output instructions and including explanatory reasons.
We found that the order of presenting reasons and scores significantly
influences LLMs' scoring, with a different level of rule understanding in the
prompt. An additional optimization may enhance scoring alignment if sufficient
data is available. This insight is crucial for improving the accuracy and
consistency of LLM-based evaluations.

通过研究大型语言模型的评估生成文本的提示设计，本研究发现不同提示结构和包含解释性原因的顺序对语言模型评分有重要影响，进而提出了优化评分一致性的方法。

文本生成的更好 LLM 评估器：提示输出排序和优化的影响

A Better LLM Evaluator for Text Generation: The Impact of Prompt Output  Sequencing and Optimization

Chatbots have been an interesting application of natural language generation
since its inception. With novel transformer based Generative AI methods,
building chatbots have become trivial. Chatbots which are targeted at specific
domains such as medicine, psychology, and general information retrieval are
implemented rapidly. This, however, should not distract from the need to
evaluate the chatbot responses. Especially because the natural language
generation community does not entirely agree upon how to effectively evaluate
such applications. With this work we discuss the issue further with the
increasingly popular LLM based evaluations and how they correlate with human
evaluations. Additionally, we introduce a comprehensive factored evaluation
mechanism that can be utilized in conjunction with both human and LLM-based
evaluations.
We present the results of an experimental evaluation conducted using this
scheme in one of our chatbot implementations, and subsequently compare
automated, traditional human evaluation, factored human evaluation, and
factored LLM evaluation. Results show that factor based evaluation produces
better insights on which aspects need to be improved in LLM applications and
further strengthens the argument to use human evaluation in critical spaces
where main functionality is not direct retrieval.

聊天机器人的评估是一个重要问题，本研究介绍了一种综合评估机制，该机制结合了人类评估和基于 LLM 的评估，并通过实验证明基于因子的评估在 LLM 应用中提供更好的洞察力，进一步加强了在主要功能不是直接检索的关键空间中使用人类评估的论点。