Chain-of-Thought (CoT) reasoning could in principle enable a deeper understanding of a language model's (LM) internal reasoning. However, prior work suggests that some LMs answer questions similarly despite changes in their CoT, suggesting that those models are not truly using the CoT. We propose a training method to produce CoTs that are sufficient alone for predicting future text, independent of other context. This methodology gives a guarantee that if the LM can predict future tokens, then it must have used the CoT to understand its context. We formalize the idea that the truthfulness of a sender to a receiver LM is the degree to which the sender helps the receiver predict their future observations. Then we define a "Markovian" LM as one which predicts future text given only a CoT as context. We derive a "Markovian training" procedure by applying our definition of truthfulness to a Markovian LM and optimizing via policy gradient and Proximal Policy Optimization (PPO). We demonstrate the effectiveness of our training algorithm on long-context arithmetic problems, show that the model utilizes the CoT, and validate that the generated CoT is meaningful and usable by other models.

链状思维推理可深度理解语言模型内部推理。我们提出了一种训练方法，能够生成独立于其他上下文的足够预测未来文本的链状思维，在确保语言模型能够预测未来标记的同时，证明其使用了链状思维来理解上下文。我们通过策略梯度和PPO优化得到“马尔可夫”语言模型的“马尔可夫训练”程序，并在长上下文算术问题上展示了训练算法的有效性，验证了生成的链状思维对其他模型的意义和可用性。

马尔科夫代理的忠实语言建模