Workshop courses designed to foster creativity are gaining popularity.
However, achieving a holistic evaluation that accommodates diverse perspectives
is challenging, even for experienced faculty teams. Adequate discussion is
essential to integrate varied assessments, but faculty often lack the time for
such deliberations. Deriving an average score without discussion undermines the
purpose of a holistic evaluation. This paper explores the use of a Large
Language Model (LLM) as a facilitator to integrate diverse faculty assessments.
Scenario-based experiments were conducted to determine if the LLM could
synthesize diverse evaluations and explain the underlying theories to faculty.
The results were noteworthy, showing that the LLM effectively facilitated
faculty discussions. Additionally, the LLM demonstrated the capability to
generalize and create evaluation criteria from a single scenario based on its
learned domain knowledge.

探索利用大型语言模型（LLM）作为促进多样评估的工具，实验表明 LLM 有效地促进师资讨论，具备从单一场景中泛化和创建评估标准的能力。

借助 LLMs 促进整体评估：基于情景实验的见解

Facilitating Holistic Evaluations with LLMs: Insights from  Scenario-Based Experiments

Retriever-augmented instruction-following models are attractive alternatives
to fine-tuned approaches for information-seeking tasks such as question
answering (QA). By simply prepending retrieved documents in its input along
with an instruction, these models can be adapted to various information domains
and tasks without additional fine-tuning. While the model responses tend to be
natural and fluent, the additional verbosity makes traditional QA evaluation
metrics such as exact match (EM) and F1 unreliable for accurately quantifying
model performance.
In this work, we investigate the performance of instruction-following models
across three information-seeking QA tasks. We use both automatic and human
evaluation to evaluate these models along two dimensions: 1) how well they
satisfy the user's information need (correctness), and 2) whether they produce
a response based on the provided knowledge (faithfulness). Guided by human
evaluation and analysis, we highlight the shortcomings of traditional metrics
for both correctness and faithfulness. We then propose simple token-overlap
based and model-based metrics that reflect the true performance of these
models. Our analysis reveals that instruction-following models are competitive,
and sometimes even outperform fine-tuned models for correctness. However, these
models struggle to stick to the provided knowledge and often hallucinate in
their responses. We hope our work encourages a more holistic evaluation of
instruction-following models for QA. Our code and data is available at
this https URL

研究中使用检索辅助的指令跟随模型在信息搜索问答任务中的性能表现，并分析了传统指标的不足之处，提出了反映这些模型真实性能的简单基于词汇重叠和模型的度量标准。研究发现，指令跟随模型在正确性方面具有一定竞争力，甚至有时优于微调模型，但在基于提供的知识的还原度上存在困难，经常出现虚构回答。