Language models have steadily increased in size over the past few years. They
achieve a high level of performance on various natural language processing
(NLP) tasks such as question answering and summarization. Large language models
(LLMs) have been used for generation and can now output human-like text. Due to
this, there are other downstream tasks in the realm of dialog that can now
harness the LLMs' language understanding capabilities. Dialog evaluation is one
task that this paper will explore. It concentrates on prompting with LLMs:
BLOOM, OPT, GPT-3, Flan-T5, InstructDial and TNLGv2. The paper shows that the
choice of datasets used for training a model contributes to how well it
performs on a task as well as on how the prompt should be structured.
Specifically, the more diverse and relevant the group of datasets that a model
is trained on, the better dialog evaluation performs. This paper also
investigates how the number of examples in the prompt and the type of example
selection used affect the model's performance.

本文探讨了大型语言模型在对话评估上的应用，发现训练模型的数据集的多样性和相关性是影响其性能的关键因素，同时探究了样本数量和使用类型对模型表现的影响。

理解大型语言模型在对话评估中的效果

Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation

Automatic evaluation metrics are a crucial component of dialog systems
research. Standard language evaluation metrics are known to be ineffective for
evaluating dialog. As such, recent research has proposed a number of novel,
dialog-specific metrics that correlate better with human judgements. Due to the
fast pace of research, many of these metrics have been assessed on different
datasets and there has as yet been no time for a systematic comparison between
them. To this end, this paper provides a comprehensive assessment of recently
proposed dialog evaluation metrics on a number of datasets. In this paper, 23
different automatic evaluation metrics are evaluated on 10 different datasets.
Furthermore, the metrics are assessed in different settings, to better qualify
their respective strengths and weaknesses. Metrics are assessed (1) on both the
turn level and the dialog level, (2) for different dialog lengths, (3) for
different dialog qualities (e.g., coherence, engaging), (4) for different types
of response generation models (i.e., generative, retrieval, simple models and
state-of-the-art models), (5) taking into account the similarity of different
metrics and (6) exploring combinations of different metrics. This comprehensive
assessment offers several takeaways pertaining to dialog evaluation metrics in
general. It also suggests how to best assess evaluation metrics and indicates
promising directions for future work.

这篇论文对 23 种不同的自动评估度量在 10 个不同的数据集上进行了评估，并在不同的设置中对其进行了评估，从而更好地确定它们各自的优缺点。综合评估提供了关于对话评估指标的几个认识，为未来的研究提供了有用的指导。

对话评估指标的全面评估

A Comprehensive Assessment of Dialog Evaluation Metrics

Dialog evaluation is a challenging problem, especially for non task-oriented
dialogs where conversational success is not well-defined. We propose to
evaluate dialog quality using topic-based metrics that describe the ability of
a conversational bot to sustain coherent and engaging conversations on a topic,
and the diversity of topics that a bot can handle. To detect conversation
topics per utterance, we adopt Deep Average Networks (DAN) and train a topic
classifier on a variety of question and query data categorized into multiple
topics. We propose a novel extension to DAN by adding a topic-word attention
table that allows the system to jointly capture topic keywords in an utterance
and perform topic classification. We compare our proposed topic based metrics
with the ratings provided by users and show that our metrics both correlate
with and complement human judgment. Our analysis is performed on tens of
thousands of real human-bot dialogs from the Alexa Prize competition and
highlights user expectations for conversational bots.

使用基于话题的度量标准来评估对话质量，包括考虑对话机器人在话题上维持连贯和有吸引力的对话能力及其多样性，并采用深度平均网络和话题分类器检测每个话语的对话话题，进一步引入话题关注表以捕捉话语中的话题关键字以及执行话题分类。经过与用户提供的评分进行比较，研究表明这些度量标准既与人类判断相关又补充人类判断，并且在亚历山大奖竞赛中对数万个真实人 - 机器人对话进行分析，凸显出用户对话机器人的期望。