We are interested in understanding how well Transformer language models (TLMs) can perform reasoning tasks when trained on knowledge encoded in the form of natural language. We investigate systematic generalization abilities on an inductive logical reasoning task in natural language, which involves reasoning over relationships between entities grounded in first-order logical proofs. Specifically, we perform soft theorem-proving by leveraging TLMs to generate logical proofs represented in natural language. We systematically test proof generation capabilities, along with inference capabilities leveraging the generated proofs. We observe length-generalization issues in proof generation and inference when evaluated on longer-than-trained sequences. However, we observe TLMs improve their generalization performance after being exposed to longer, exhaustive proofs. In addition, we discover that TLMs are able to generalize better using backward-chaining proofs compared to their forward-chaining counterparts, while they find it easier to generate forward chaining proofs. We observe that models that are not trained to generate proofs are better at generalizing to problems based on longer proofs. This result suggests that Transformers have efficient, yet not interpretable reasoning strategies internally. These results also highlight the systematic generalization issues in TLMs in the context of logical reasoning, and we believe this work will motivate deeper inspection of their underlying reasoning strategies.

研究Transformer语言模型在自然语言中进行基于逻辑推理的任务，探究它们的系统泛化能力，发现其在逆向推理证明方面表现更优，并且发现没有经过证明生成训练的模型更适合处理长证明的问题。研究结果强调了TLM在逻辑推理中的系统泛化行为，并且对其核心推理策略的深入研究提出了启示。

使用Transformer测量神经证明生成中的系统化概括能力