Large language models have consistently struggled with complex reasoning tasks, such as mathematical problem-solving. Investigating the internal reasoning mechanisms of these models can help us design better model architectures and training strategies, ultimately enhancing their reasoning capabilities. In this study, we examine the matching mechanism employed by Transformer for multi-step reasoning on a constructed dataset. We investigate factors that influence the model's matching mechanism and discover that small initialization and post-LayerNorm can facilitate the formation of the matching mechanism, thereby enhancing the model's reasoning ability. Moreover, we propose a method to improve the model's reasoning capability by adding orthogonal noise. Finally, we investigate the parallel reasoning mechanism of Transformers and propose a conjecture on the upper bound of the model's reasoning ability based on this phenomenon. These insights contribute to a deeper understanding of the reasoning processes in large language models and guide designing more effective reasoning architectures and training strategies.

通过研究Transformer中的匹配机制，我们发现小的初始化和LayerNorm后处理可以促进匹配机制的形成，从而增强模型的推理能力；此外，通过添加正交噪声来改进模型的推理能力，并就Transformer的并行推理机制提出一个假设，提高对大型语言模型推理过程的理解并引导设计更有效的推理架构和训练策略。

理解Transformer如何执行多步推理与匹配操作