Large language models exhibit high-level commonsense reasoning abilities, especially with enhancement methods like Chain-of-Thought (CoT). However, we find these CoT-like methods lead to a considerable number of originally correct answers turning wrong, which we define as the Toxic CoT problem. To interpret and mitigate this problem, we first utilize attribution tracing and causal tracing methods to probe the internal working mechanism of the LLM during CoT reasoning. Through comparisons, we prove that the model exhibits information loss from the question over the shallow attention layers when generating rationales or answers. Based on the probing findings, we design a novel method called RIDERS (Residual decodIng and sERial-position Swap), which compensates for the information deficit in the model from both decoding and serial-position perspectives. Through extensive experiments on multiple commonsense reasoning benchmarks, we validate that this method not only significantly eliminates Toxic CoT problems (decreased by 23.6%), but also effectively improves the model's overall commonsense reasoning performance (increased by 5.5%).

大型语言模型通过链式思维等增强方法展现出高级的常识推理能力，但我们发现这些类似链式思维的方法会导致很多原本正确的答案变为错误，这就是我们所定义的有害链式思维问题。为了解释和减轻这个问题，我们首先利用归因追踪和因果追踪方法来探究LLM在链式思维推理过程中的内部工作机制。通过比较，我们证明模型在生成合理解释或答案时存在问题，即问题信息在浅层注意力层上丢失。根据探究结果，我们设计了一种称为RIDERS（残差解码和串行位置交换）的新方法，从解码和串行位置的角度补偿模型的信息不足。通过在多个常识推理基准上进行广泛实验证明，这种方法不仅显著减少了有害链式思维问题（减少了23.6%），还有效提升了模型的整体常识推理性能（增加了5.5%）。

专注于您的问题！解释和缓解常识推理中的有毒CoT问题