Large language models (LLMs) have shown strong arithmetic reasoning
capabilities when prompted with Chain-of-Thought (CoT) prompts. However, we
have only a limited understanding of how they are processed by LLMs. To
demystify it, prior work has primarily focused on ablating different components
in the CoT prompt and empirically observing their resulting LLM performance
change. Yet, the reason why these components are important to LLM reasoning is
not explored. To fill this gap, in this work, we investigate ``neuron
activation'' as a lens to provide a unified explanation to observations made by
prior work. Specifically, we look into neurons within the feed-forward layers
of LLMs that may have activated their arithmetic reasoning capabilities, using
Llama2 as an example. To facilitate this investigation, we also propose an
approach based on GPT-4 to automatically identify neurons that imply arithmetic
reasoning. Our analyses revealed that the activation of reasoning neurons in
the feed-forward layers of an LLM can explain the importance of various
components in a CoT prompt, and future research can extend it for a more
complete understanding.

通过研究神经元激活，我们探索了大型语言模型中算术推理能力的重要性，以及神经元激活对 CoT 提示的 components 的影响，并提出了一个基于 GPT-4 的方法来自动识别涉及算术推理的神经元。

研究神经元激活作为统一视角来解释 LLM 的引发思维链的算术推理

An Investigation of Neuron Activation as a Unified Lens to Explain  Chain-of-Thought Eliciting Arithmetic Reasoning of LLMs

The field of natural language processing has reached breakthroughs with the
advent of transformers. They have remained state-of-the-art since then, and
there also has been much research in analyzing, interpreting, and evaluating
the attention layers and the underlying embedding space. In addition to the
self-attention layers, the feed-forward layers in the transformer are a
prominent architectural component. From extensive research, we observe that its
role is under-explored. We focus on the latent space, known as the Activation
Space, that consists of the neuron activations from these feed-forward layers.
In this survey paper, we review interpretability methods that examine the
learnings that occurred in this activation space. Since there exists only
limited research in this direction, we conduct a detailed examination of each
work and point out potential future directions of research. We hope our work
provides a step towards strengthening activation space analysis.

该研究论文探讨自然语言处理领域的可解释性方法，重点关注 transformer 中前馈层激活空间（Activation Space），旨在加强该领域的研究。

变压器激活空间分析中的可解释性：重点调查

Interpretability in Activation Space Analysis of Transformers: A Focused Survey

Feed-forward layers constitute two-thirds of a transformer model's
parameters, yet their role in the network remains under-explored. We show that
feed-forward layers in transformer-based language models operate as key-value
memories, where each key correlates with textual patterns in the training
examples, and each value induces a distribution over the output vocabulary. Our
experiments show that the learned patterns are human-interpretable, and that
lower layers tend to capture shallow patterns, while upper layers learn more
semantic ones. The values complement the keys' input patterns by inducing
output distributions that concentrate probability mass on tokens likely to
appear immediately after each pattern, particularly in the upper layers.
Finally, we demonstrate that the output of a feed-forward layer is a
composition of its memories, which is subsequently refined throughout the
model's layers via residual connections to produce the final output
distribution.

通过实验我们发现，transformer 模型中的前馈层作为键值内存操作，其键与训练示例中的文本模式相关，并且每个值在输出词汇表上引入一个分布。同时利用残差连接使得前馈层的输出分布集中于出现在每个模式后的可能出现的标记。

Transformer 前馈层即键值内存

Transformer Feed-Forward Layers Are Key-Value Memories

State-of-the-art results on neural machine translation often use attentional
sequence-to-sequence models with some form of convolution or recursion. Vaswani
et al. (2017) propose a new architecture that avoids recurrence and convolution
completely. Instead, it uses only self-attention and feed-forward layers. While
the proposed architecture achieves state-of-the-art results on several machine
translation tasks, it requires a large number of parameters and training
iterations to converge. We propose Weighted Transformer, a Transformer with
modified attention layers, that not only outperforms the baseline network in
BLEU score but also converges 15-40% faster. Specifically, we replace the
multi-head attention by multiple self-attention branches that the model learns
to combine during the training process. Our model improves the state-of-the-art
performance by 0.5 BLEU points on the WMT 2014 English-to-German translation
task and by 0.4 on the English-to-French translation task.

本篇论文提出基于注意力机制的神经机器翻译新架构，采用自注意力和前馈神经网络层来避免递归和卷积，但是需要大量的参数和训练才能收敛。同时又提出了加权 Transformer 模型，通过修改注意力层架构，更快地提高 BLEU 得分表现，同时在英译德和英译法机器翻译任务中表现最优。