This paper investigates the logical reasoning capabilities of large language models (LLMs). For a precisely defined yet tractable formulation, we choose the conceptually simple but technically complex task of constructing proofs in Boolean logic. A trained LLM receives as input a set of assumptions and a goal, and produces as output a proof that formally derives the goal from the assumptions. Incorrect proofs are caught by an automated proof checker. A critical obstacle for training is the scarcity of real-world proofs. We propose an efficient, randomized procedure for synthesizing valid proofs and introduce Template Transformation, a data augmentation technique that enhances the model's ability to handle complex logical expressions. The central evaluation question is whether an LLM has indeed learned to reason. We propose tests to measure the reasoning ability of a black-box LLM. By these measures, experiments demonstrate strong reasoning capabilities for assertions with short proofs, which decline with proof complexity. Notably, template transformation improves accuracy even for smaller models, suggesting its effectiveness across model scales.

本研究探讨大型语言模型（LLM）的逻辑推理能力，针对传统逻辑证明的训练提供了一种新的数据增强方法。通过合成有效证明和模板转化，研究发现LLM在短证明的推理能力强，但在复杂证明时能力下降，模板转化显著提升了模型的准确性，揭示了其对不同规模模型的广泛适用性。

大语言模型能否学习形式逻辑？一种数据驱动的训练与评估框架