Large Language Models (LLMs) achieve impressive performance in a wide range of tasks, even if they are often trained with the only objective of chatting fluently with users. Among other skills, LLMs show emergent abilities in mathematical reasoning benchmarks, which can be elicited with appropriate prompting methods. In this work, we systematically investigate the capabilities and limitations of popular open-source LLMs on different symbolic reasoning tasks. We evaluate three models of the Llama 2 family on two datasets that require solving mathematical formulas of varying degrees of difficulty. We test a generalist LLM (Llama 2 Chat) as well as two fine-tuned versions of Llama 2 (MAmmoTH and MetaMath) specifically designed to tackle mathematical problems. We observe that both increasing the scale of the model and fine-tuning it on relevant tasks lead to significant performance gains. Furthermore, using fine-grained evaluation measures, we find that such performance gains are mostly observed with mathematical formulas of low complexity, which nevertheless often remain challenging even for the largest fine-tuned models.

大型语言模型在各种任务中取得了令人印象深刻的表现，即使它们通常只是为了与用户流畅聊天而训练。本文系统地研究了流行的开源大型语言模型在不同符号推理任务上的能力和局限性。我们评估了Llama 2家族的三个模型在两个需要解决不同难度数学公式的数据集上。我们测试了一个通用的大型语言模型（Llama 2 Chat）以及两个专门设计用于解决数学问题的经过微调的Llama 2版本（MAmmoTH和MetaMath）。我们观察到，增加模型规模并在相关任务上进行微调可以显著提高性能。此外，通过使用细粒度的评估指标，我们发现这种性能提升主要出现在复杂度较低的数学公式上，尽管对于最大的经过微调的模型来说，这些公式通常仍然具有一定挑战性。

评估拉马大型语言模型的新兴符号推理能力