We introduce Mathador-LM, a new benchmark for evaluating the mathematical reasoning on large language models (LLMs), combining ruleset interpretation, planning, and problem-solving. This benchmark is inspired by the Mathador game, where the objective is to reach a target number using basic arithmetic operations on a given set of base numbers, following a simple set of rules. We show that, across leading LLMs, we obtain stable average performance while generating benchmark instances dynamically, following a target difficulty level. Thus, our benchmark alleviates concerns about test-set leakage into training data, an issue that often undermines popular benchmarks. Additionally, we conduct a comprehensive evaluation of both open and closed-source state-of-the-art LLMs on Mathador-LM. Our findings reveal that contemporary models struggle with Mathador-LM, scoring significantly lower than average 5th graders. This stands in stark contrast to their strong performance on popular mathematical reasoning benchmarks.

Mathador-LM是用于评估大型语言模型在数学推理上的新基准，结合了规则解释、规划和问题求解。该基准受Mathador游戏启发，其目标是使用给定的一组基本数字和简单的规则，通过基本算术运算达到目标数字。我们在领先的大型语言模型中展示了稳定的平均性能，并动态生成基准实例，以符合目标难度级别。因此，我们的基准缓解了测试集泄露到训练数据中的问题，这是经常破坏流行基准的一个问题。此外，我们对Mathador-LM中的开源和闭源最新大型语言模型进行了全面评估。我们的发现表明，现代模型在Mathador-LM上面临困难，得分显著低于平均5年级学生，这与它们在流行数学推理基准上的强大表现形成鲜明对比。

Mathador-LM：大型语言模型上的数学推理动态评估