Large Language Models (LLMs) have shown remarkable proficiency in language
understanding and have been successfully applied to a variety of real-world
tasks through task-specific fine-tuning or prompt engineering. Despite these
advancements, it remains an open question whether LLMs are fundamentally
capable of reasoning and planning, or if they primarily rely on recalling and
synthesizing information from their training data. In our research, we
introduce a novel task -- Minesweeper -- specifically designed in a format
unfamiliar to LLMs and absent from their training datasets. This task
challenges LLMs to identify the locations of mines based on numerical clues
provided by adjacent opened cells. Successfully completing this task requires
an understanding of each cell's state, discerning spatial relationships between
the clues and mines, and strategizing actions based on logical deductions drawn
from the arrangement of the cells. Our experiments, including trials with the
advanced GPT-4 model, indicate that while LLMs possess the foundational
abilities required for this task, they struggle to integrate these into a
coherent, multi-step logical reasoning process needed to solve Minesweeper.
These findings highlight the need for further research to understand and nature
of reasoning capabilities in LLMs under similar circumstances, and to explore
pathways towards more sophisticated AI reasoning and planning models.

我们的研究引入了一种新的任务 -- 扫雷，旨在测试 LLMs 在陌生格式的任务中的推理和规划能力；我们的实验证明，尽管 LLMs 具备完成该任务所需的基本能力，但它们在将这些能力整合成解决扫雷问题所需的连贯的多步骤逻辑推理过程方面存在困难。这些发现强调了进一步研究 LLMs 推理能力及探索更复杂的 AI 推理和规划模型的必要性。