The standard methodology of evaluating large language models (LLMs) based on
static pairs of inputs and outputs is insufficient for developing assistants:
this kind of assessments fails to take into account the essential interactive
element in their deployment, and therefore limits how we understand language
model capabilities. We introduce CheckMate, an adaptable prototype platform for
humans to interact with and evaluate LLMs. We conduct a study with CheckMate to
evaluate three language models~(InstructGPT, ChatGPT, and GPT-4) as assistants
in proving undergraduate-level mathematics, with a mixed cohort of participants
from undergraduate students to professors of mathematics. We release the
resulting interaction and rating dataset, MathConverse. By analysing
MathConverse, we derive a preliminary taxonomy of human behaviours and uncover
that despite a generally positive correlation, there are notable instances of
divergence between correctness and perceived helpfulness in LLM generations,
amongst other findings. Further, we identify useful scenarios and existing
issues of GPT-4 in mathematical reasoning through a series of case studies
contributed by expert mathematicians. We conclude with actionable takeaways for
ML practitioners and mathematicians: models which communicate uncertainty,
respond well to user corrections, are more interpretable and concise may
constitute better assistants; interactive evaluation is a promising way to
continually navigate the capability of these models; humans should be aware of
language models' algebraic fallibility, and for that reason discern where they
should be used.

使用交互式评估方法评估大型语言模型在大学级数学推理方面的能力，为人工智能从业者和数学教授提供可行的建议，重点在于模型应该如何处理不确定性和人类纠错。

通过交互评估数学语言模型

Evaluating Language Models for Mathematics through Interactions

The ultimate goal of dialog research is to develop systems that can be
effectively used in interactive settings by real users. To this end, we
introduced the Interactive Evaluation of Dialog Track at the 9th Dialog System
Technology Challenge. This track consisted of two sub-tasks. The first sub-task
involved building knowledge-grounded response generation models. The second
sub-task aimed to extend dialog models beyond static datasets by assessing them
in an interactive setting with real users. Our track challenges participants to
develop strong response generation models and explore strategies that extend
them to back-and-forth interactions with real users. The progression from
static corpora to interactive evaluation introduces unique challenges and
facilitates a more thorough assessment of open-domain dialog systems. This
paper provides an overview of the track, including the methodology and results.
Furthermore, it provides insights into how to best evaluate open-domain dialog
models

该论文介绍了针对开放领域对话系统的交互式评估方法，为开发基于知识的响应生成模型，探索将其扩展到与真实用户的交互中提供挑战，从而介绍了从静态语料库到交互式评估的进展，以及如何最好评估开放领域对话模型的见解。

DSTC9 中的对话跟踪交互式评估

Interactive Evaluation of Dialog Track at DSTC9

As conversational AI-based dialogue management has increasingly become a
trending topic, the need for a standardized and reliable evaluation procedure
grows even more pressing. The current state of affairs suggests various
evaluation protocols to assess chat-oriented dialogue management systems,
rendering it difficult to conduct fair comparative studies across different
approaches and gain an insightful understanding of their values. To foster this
research, a more robust evaluation protocol must be set in place. This paper
presents a comprehensive synthesis of both automated and human evaluation
methods on dialogue systems, identifying their shortcomings while accumulating
evidence towards the most effective evaluation dimensions. A total of 20 papers
from the last two years are surveyed to analyze three types of evaluation
protocols: automated, static, and interactive. Finally, the evaluation
dimensions used in these papers are compared against our expert evaluation on
the system-user dialogue data collected from the Alexa Prize 2020.

本文针对对话系统中的评估协议不统一的问题，综合研究了人工评估和自动评估方法，建议建立更加健壮和统一的评估协议，并对目前使用的自动、静态和交互式评估方法进行分析，最终通过与 Alexa Prize 2020 中的系统 - 用户对话数据进行比较，提出最有效的评估维度。