Large language models (LLMs) have demonstrated impressive capabilities in mathematical problem solving, particularly in single turn question answering formats. However, real world scenarios often involve mathematical question answering that requires multi turn or interactive information exchanges, and the performance of LLMs on these tasks is still underexplored. This paper introduces MathChat, a comprehensive benchmark specifically designed to evaluate LLMs across a broader spectrum of mathematical tasks. These tasks are structured to assess the models' abilities in multiturn interactions and open ended generation. We evaluate the performance of various SOTA LLMs on the MathChat benchmark, and we observe that while these models excel in single turn question answering, they significantly underperform in more complex scenarios that require sustained reasoning and dialogue understanding. To address the above limitations of existing LLMs when faced with multiturn and open ended tasks, we develop MathChat sync, a synthetic dialogue based math dataset for LLM finetuning, focusing on improving models' interaction and instruction following capabilities in conversations. Experimental results emphasize the need for training LLMs with diverse, conversational instruction tuning datasets like MathChatsync. We believe this work outlines one promising direction for improving the multiturn mathematical reasoning abilities of LLMs, thus pushing forward the development of LLMs that are more adept at interactive mathematical problem solving and real world applications.

这篇论文介绍了一个专门设计用来评估大型语言模型在更广泛的数学任务上的MathChat基准测试，并观察到这些模型在单回合问题回答方面表现出色，但在需要持续推理和对话理解的复杂场景下性能显著下降。通过开发MathChat sync这样一个用于提升模型对话能力和指令跟随能力的合成对话型数学数据集，实验结果强调了使用类似MathChat sync这样多样化的对话指令微调数据集训练大型语言模型的必要性。作者认为这项工作为改进大型语言模型的多轮数学推理能力指明了一个有希望的方向，推动了更擅长交互式数学问题解决和实际应用的大型语言模型的发展。

MathChat：多轮交互中数学推理和指令遵循的基准评估