The advent of Large Language Models (LLMs) has drastically enhanced dialogue
systems. However, comprehensively evaluating the dialogue abilities of LLMs
remains a challenge. Previous benchmarks have primarily focused on single-turn
dialogues or provided coarse-grained and incomplete assessments of multi-turn
dialogues, overlooking the complexity and fine-grained nuances of real-life
dialogues. To address this issue, we introduce MT-Bench-101, specifically
designed to evaluate the fine-grained abilities of LLMs in multi-turn
dialogues. By conducting a detailed analysis of real multi-turn dialogue data,
we construct a three-tier hierarchical ability taxonomy comprising 4208 turns
across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21
popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both
ability and task perspectives and observing differing trends in LLMs
performance across dialogue turns within various tasks. Further analysis
indicates that neither utilizing common alignment techniques nor chat-specific
designs has led to obvious enhancements in the multi-turn abilities of LLMs.
Extensive case studies suggest that our designed tasks accurately assess the
corresponding multi-turn abilities.

通过对真实的多轮对话数据的详细分析，在多轮对话方面构建了一个包含 1388 个多轮对话中 4208 个轮次的三层次能力分类系统，并评估了 21 个流行的大型语言模型在多任务评估基准 MT-Bench-101 上的能力以及对话中的性能差异。进一步的分析表明，无论是使用常见的对齐技术还是特定于聊天的设计，都没有明显改善大型语言模型的多轮对话能力。广泛的案例研究表明，我们设计的任务能够准确评估相应的多轮对话能力。