Modern language models can imitate complex patterns through few-shot
learning, enabling them to complete challenging tasks without fine-tuning.
However, imitation can also lead models to reproduce inaccuracies or harmful
content if present in the context. We study harmful imitation through the lens
of a model's internal representations, and identify two related phenomena:
overthinking and false induction heads. The first phenomenon, overthinking,
appears when we decode predictions from intermediate layers, given correct vs.
incorrect few-shot demonstrations. At early layers, both demonstrations induce
similar model behavior, but the behavior diverges sharply at some "critical
layer", after which the accuracy given incorrect demonstrations progressively
decreases. The second phenomenon, false induction heads, are a possible
mechanistic cause of overthinking: these are heads in late layers that attend
to and copy false information from previous demonstrations, and whose ablation
reduces overthinking. Beyond scientific understanding, our results suggest that
studying intermediate model computations could be a promising avenue for
understanding and guarding against harmful model behaviors.

研究发现现代语言模型通过少样本学习可以模仿复杂模式，但这种模仿可能导致不准确或有害内容的复制。通过分析模型的内部表示，发现了两个相关现象：过度思考和错误归纳头。过度思考现象在解码中间层的预测时出现，给出正确和错误的少样本演示。在早期层次，两个演示引起了类似的模型行为，但在某个 “临界层” 之后，给出错误演示时的准确性逐渐降低。错误归纳头可能是过度思考的机械原因：它们是位于较晚层次的头部，关注并复制先前演示中的错误信息，去除这些头部可以减少过度思考。除了科学理解，研究结果表明，研究模型计算中间过程可能是理解和预防有害模型行为的一个有前景的途径。

过度思考真相：理解语言模型处理错误演示的方法

Overthinking the Truth: Understanding how Language Models Process False  Demonstrations

Machine learning systems perform well on pattern matching tasks, but their
ability to perform algorithmic or logical reasoning is not well understood. One
important reasoning capability is algorithmic extrapolation, in which models
trained only on small/simple reasoning problems can synthesize complex
strategies for large/complex problems at test time. Algorithmic extrapolation
can be achieved through recurrent systems, which can be iterated many times to
solve difficult reasoning problems. We observe that this approach fails to
scale to highly complex problems because behavior degenerates when many
iterations are applied -- an issue we refer to as "overthinking." We propose a
recall architecture that keeps an explicit copy of the problem instance in
memory so that it cannot be forgotten. We also employ a progressive training
routine that prevents the model from learning behaviors that are specific to
iteration number and instead pushes it to learn behaviors that can be repeated
indefinitely. These innovations prevent the overthinking problem, and enable
recurrent systems to solve extremely hard extrapolation tasks.

本研究提出了一种记忆回溯网络算法，采用保留问题实例的显式副本以及渐进式训练方法，解决了循环系统复杂问题迭代次数过多导致退化行为的问题，从而使循环系统能够解决极难的算法推理问题。

使用循环神经网络进行端到端算法合成：逻辑推断而不过度思考

End-to-end Algorithm Synthesis with Recurrent Networks: Logical Extrapolation Without Overthinking

We characterize a prevalent weakness of deep neural networks
(DNNs)---overthinking---which occurs when a DNN can reach correct predictions
before its final layer. Overthinking is computationally wasteful, and it can
also be destructive when, by the final layer, a correct prediction changes into
a misclassification. Understanding overthinking requires studying how each
prediction evolves during a DNN's forward pass, which conventionally is opaque.
For prediction transparency, we propose the Shallow-Deep Network (SDN), a
generic modification to off-the-shelf DNNs that introduces internal
classifiers. We apply SDN to four modern architectures, trained on three image
classification tasks, to characterize the overthinking problem. We show that
SDNs can mitigate the wasteful effect of overthinking with confidence-based
early exits, which reduce the average inference cost by more than 50% and
preserve the accuracy. We also find that the destructive effect occurs for 50%
of misclassifications on natural inputs and that it can be induced,
adversarially, with a recent backdooring attack. To mitigate this effect, we
propose a new confusion metric to quantify the internal disagreements that will
likely lead to misclassifications.

本研究发现深度神经网络中的过度思考现象及其带来的计算浪费和错误分类问题，提出了增加内部分类器的 Shallow-Deep Network 对于内部分类的可见性，同时通过引入基于置信度的早期决策来减少计算浪费并避免了 50％的自然输入误分类问题，并提出新的混淆度量方法来量化导致误分类的内部分歧。