To solve complex tasks, large language models (LLMs) often require multiple
rounds of interactions with the user, sometimes assisted by external tools.
However, current evaluation paradigms often focus solely on benchmark
performance with single-turn exchanges, neglecting the intricate interactions
among the user, LLMs, and external tools, creating a discrepancy between
benchmark evaluation and real-world use cases. We introduce MINT benchmark to
evaluate LLMs' ability to solve tasks with multi-turn interactions by (1) using
tools and (2) leveraging natural language feedback. To ensure reproducibility,
we provide an evaluation framework where LLMs can access tools by executing
Python code and receive natural language feedback from the user simulated with
GPT-4. We repurpose a diverse set of established datasets and tasks focusing on
reasoning, coding, and decision-making and carefully curate them into a compact
subset of instances for efficient evaluation. Our analysis of 20 open- and
closed-source LLMs offers intriguing findings. (1) LLMs generally benefit from
tool interactions and language feedback, with performance gains (absolute, same
below) of 1--8% per additional turn with tool use and 2--17% with natural
language feedback. (2) Better single-turn performance does not guarantee better
multi-turn performance. (3) Surprisingly, on LLMs we evaluated, we found
supervised instruction-finetuning (SIFT) and reinforcement learning from human
feedback (RLHF) generally hurt multi-turn capabilities. We hope MINT can help
measure progress and incentivize research in improving LLMs' capabilities in
multi-turn interactions, especially for open-source communities where
multi-turn human evaluation has been less accessible compared to commercial
LLMs with a larger user base.

通过使用工具和自然语言反馈，MINT 基准测试评估了大型语言模型在解决具有多回合交互的任务时的能力，并从 20 个开源和闭源的语言模型分析中发现，在工具交互和自然语言反馈的情况下，LLMs 的性能有所提升。