This paper describes the systems submitted by team6 for ChatEval, the DSTC 11
Track 4 competition. We present three different approaches to predicting
turn-level qualities of chatbot responses based on large language models
(LLMs). We report improvement over the baseline using dynamic few-shot examples
from a vector store for the prompts for ChatGPT. We also analyze the
performance of the other two approaches and report needed improvements for
future work. We developed the three systems over just two weeks, showing the
potential of LLMs for this task. An ablation study conducted after the
challenge deadline shows that the new Llama 2 models are closing the
performance gap between ChatGPT and open-source LLMs. However, we find that the
Llama 2 models do not benefit from few-shot examples in the same way as
ChatGPT.

本文通过三种不同的方法，基于大型语言模型（LLMs）对于 ChatGPT 响应的逐轮质量进行预测，并使用动态少量样本来改善基准，并分析了其他两种方法的性能并提出未来研究的改进。研究表明，Llama 2 模型正在缩小 ChatGPT 和开源 LLMs 之间的性能差距，但发现 Llama 2 模型不能像 ChatGPT 那样从少量样本中受益。