We evaluate the reasoning abilities of large language models in multilingual settings. We introduce the Multilingual Grade School Math (MGSM) benchmark, by manually translating 250 grade-school math problems from the GSM8K dataset (Cobbe et al., 2021) into ten typologically diverse languages. We find that the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale, and that models have strikingly strong multilingual reasoning abilities, even in underrepresented languages such as Bengali and Swahili. Finally, we show that the multilingual reasoning abilities of language models extend to other tasks such as commonsense reasoning and word-in-context semantic judgment. The MGSM benchmark is publicly available at https://github.com/google-research/url-nlp.

通过使用多种类型不同的语言，我们通过手动将 GSM8K 数据集中的 250 个小学数学问题翻译成十种不同的语言，评估了大型语言模型在多语种环境下的推理能力，并提出了 MGSM 基准。我们发现，随着模型规模的增加，使用思维链提示解决 MGSM 问题的能力越来越强，即使在孟加拉语和斯瓦希里语等少数语言中，这些模型也具有非常强的多语种推理能力。最后，我们展示了语言模型的多语种推理能力扩展到其他任务，例如常识推理和上下文语义判断。

语言模型是多语言的思维链推理器