The Transformer architecture consists of self-attention and feed-forward
networks (FFNs) which can be viewed as key-value memories according to previous
works. However, FFN and traditional memory utilize different activation
functions (i.e., ReLU and Softmax respectively), which makes them not
equivalent. In this paper, we first rebuild the connections between FFN and
key-value memory by conducting extensive studies on ReLU and Softmax, and find
they are equivalent when adding an additional layer normalization module on
Softmax. In addition, ReLU outperforms Softmax on both FFN and key-value memory
when the number of value slots is large. We analyze the reasons and then
explore this good property of ReLU on the self-attention network where the
original Softmax activation performs poorly on long input sequences. We then
propose a full ReLU architecture named ReLUFormer which performs better than
the baseline Transformer on long sequence tasks such as document translation.
This paper sheds light on the following points: 1) Softmax and ReLU use
different normalization methods over elements which lead to different variances
of results, and ReLU is good at dealing with a large number of key-value slots;
2) FFN and key-value memory are equivalent, and thus the Transformer can be
viewed as a memory network where FFNs and self-attention networks are both
key-value memories.

本文研究了 Transformer 模型的架构，介绍了自注意力机制和前馈神经网络，并重建了 ReLU 和 Softmax 之间的关系，提出了使用额外层归一化模块的 Softmax 和 ReLU 相等的概念。此外，研究发现 ReLU 可以处理大量键值槽以及在输入序列很长时表现更出色，并提出了一个全 ReLU 模型–ReLUFormer，在文档翻译等长序列任务中表现更好。

Transformer 中 ReLU 和 Softmax 的研究

A Study on ReLU and Softmax in Transformer

Access to external knowledge is essential for many natural language
processing tasks, such as question answering and dialogue. Existing methods
often rely on a parametric model that stores knowledge in its parameters, or
use a retrieval-augmented model that has access to an external knowledge
source. Parametric and retrieval-augmented models have complementary strengths
in terms of computational efficiency and predictive accuracy. To combine the
strength of both approaches, we propose the Efficient Memory-Augmented
Transformer (EMAT) -- it encodes external knowledge into a key-value memory and
exploits the fast maximum inner product search for memory querying. We also
introduce pre-training tasks that allow EMAT to encode informative key-value
representations, and to learn an implicit strategy to integrate multiple memory
slots into the transformer. Experiments on various knowledge-intensive tasks
such as question answering and dialogue datasets show that, simply augmenting
parametric models (T5-base) using our method produces more accurate results
(e.g., 25.8 -> 44.3 EM on NQ) while retaining a high throughput (e.g., 1000
queries/s on NQ). Compared to retrieval-augmented models, EMAT runs
substantially faster across the board and produces more accurate results on WoW
and ELI5. Our code and datasets are available at https://github.
com/uclnlp/EMAT.

提出了 Efficient Memory-Augmented Transformer (EMAT) 作为一种结合参数式模型和检索式增强模型的方法，有效地利用外部知识源以提高自然语言处理任务的准确性和计算效率。通过将外部知识编码为键值内存，并利用内积搜索来查询，使用预训练任务编码有信息的键值表示，并学习将多个内存插槽集成到变压器中的隐式策略，EMAT 在众多知识密集型任务上取得了更准确的结果。