Despite significant research effort in the development of automatic dialogue
evaluation metrics, little thought is given to evaluating dialogues other than
in English. At the same time, ensuring metrics are invariant to semantically
similar responses is also an overlooked topic. In order to achieve the desired
properties of robustness and multilinguality for dialogue evaluation metrics,
we propose a novel framework that takes advantage of the strengths of current
evaluation models with the newly-established paradigm of prompting Large
Language Models (LLMs). Empirical results show our framework achieves state of
the art results in terms of mean Spearman correlation scores across several
benchmarks and ranks first place on both the Robust and Multilingual tasks of
the DSTC11 Track 4 "Automatic Evaluation Metrics for Open-Domain Dialogue
Systems", proving the evaluation capabilities of prompted LLMs.

研究通过结合当前评估模型的优势与新建立的提示性大语言模型的范式，提出了一种新颖的框架，以实现对对话的鲁棒性和多语言性评估能力，并在多个基准测试中取得了最先进的成果，并在 DSTC11 轨道 4 “开放领域对话系统的自动评估指标” 中分别在鲁棒性和多语言任务中名列前茅，证明了提示性大语言模型的评估能力。

简单的 LLM 提示是稳健多语言对话评估的尖端技术

Simple LLM Prompting is State-of-the-Art for Robust and Multilingual  Dialogue Evaluation

Chatbots are designed to carry out human-like conversations across different
domains, such as general chit-chat, knowledge exchange, and persona-grounded
conversations. To measure the quality of such conversational agents, a dialogue
evaluator is expected to conduct assessment across domains as well. However,
most of the state-of-the-art automatic dialogue evaluation metrics (ADMs) are
not designed for multi-domain evaluation. We are motivated to design a general
and robust framework, MDD-Eval, to address the problem. Specifically, we first
train a teacher evaluator with human-annotated data to acquire a rating skill
to tell good dialogue responses from bad ones in a particular domain and then,
adopt a self-training strategy to train a new evaluator with teacher-annotated
multi-domain data, that helps the new evaluator to generalize across multiple
domains. MDD-Eval is extensively assessed on six dialogue evaluation
benchmarks. Empirical results show that the MDD-Eval framework achieves a
strong performance with an absolute improvement of 7% over the state-of-the-art
ADMs in terms of mean Spearman correlation scores across all the evaluation
benchmarks.

提出了 MDD-Eval 框架，通过对话内评价与跨领域自学得到多领域评价能力，从而使得相较于现有自动对话评价指标，其在 6 个评价基准测试中可以取得 7% 的平均 Spearman 相关性分数的显著提高。