Recent research suggests that the feed-forward module within Transformers can
be viewed as a collection of key-value memories, where the keys learn to
capture specific patterns from the input based on the training examples. The
values then combine the output from the 'memories' of the keys to generate
predictions about the next token. This leads to an incremental process of
prediction that gradually converges towards the final token choice near the
output layers. This interesting perspective raises questions about how
multilingual models might leverage this mechanism. Specifically, for
autoregressive models trained on two or more languages, do all neurons (across
layers) respond equally to all languages? No! Our hypothesis centers around the
notion that during pretraining, certain model parameters learn strong
language-specific features, while others learn more language-agnostic (shared
across languages) features. To validate this, we conduct experiments utilizing
parallel corpora of two languages that the model was initially pretrained on.
Our findings reveal that the layers closest to the network's input or output
tend to exhibit more language-specific behaviour compared to the layers in the
middle.

通过分析 Transformer 中的前向模块，研究表明其可以被视为一系列键值记忆，提出了关于多语言模型中神经元对不同语言的响应不平等的假设，并通过实验证实了此假设。

揭示 Transformer 模型中的多语言性：探索前向网络中的语言特征

Unveiling Multilinguality in Transformer Models: Exploring Language  Specificity in Feed-Forward Networks

Feed-forward layers constitute two-thirds of a transformer model's
parameters, yet their role in the network remains under-explored. We show that
feed-forward layers in transformer-based language models operate as key-value
memories, where each key correlates with textual patterns in the training
examples, and each value induces a distribution over the output vocabulary. Our
experiments show that the learned patterns are human-interpretable, and that
lower layers tend to capture shallow patterns, while upper layers learn more
semantic ones. The values complement the keys' input patterns by inducing
output distributions that concentrate probability mass on tokens likely to
appear immediately after each pattern, particularly in the upper layers.
Finally, we demonstrate that the output of a feed-forward layer is a
composition of its memories, which is subsequently refined throughout the
model's layers via residual connections to produce the final output
distribution.

通过实验我们发现，transformer 模型中的前馈层作为键值内存操作，其键与训练示例中的文本模式相关，并且每个值在输出词汇表上引入一个分布。同时利用残差连接使得前馈层的输出分布集中于出现在每个模式后的可能出现的标记。

Transformer 前馈层即键值内存

Transformer Feed-Forward Layers Are Key-Value Memories

Deep reinforcement learning techniques have demonstrated superior performance
in a wide variety of environments. As improvements in training algorithms
continue at a brisk pace, theoretical or empirical studies on understanding
what these networks seem to learn, are far behind. In this paper we propose an
interpretable neural network architecture for Q-learning which provides a
global explanation of the model's behavior using key-value memories, attention
and reconstructible embeddings. With a directed exploration strategy, our model
can reach training rewards comparable to the state-of-the-art deep Q-learning
models. However, results suggest that the features extracted by the neural
network are extremely shallow and subsequent testing using out-of-sample
examples shows that the agent can easily overfit to trajectories seen during
training.

本文提出了一种可解释的神经网络架构，用于 Q-learning，在全局层面上使用键值记忆、注意力和可重构嵌入，提供模型行为的全局解释。使用有向探索策略，该模型可以达到与最先进的深度 Q-learning 模型相当的训练奖励，但结果表明该神经网络提取的特征非常浅，并且使用样本外的示例进行后续测试表明代理可以轻松地过拟合训练期间看到的轨迹。