We introduce three new attention mechanisms that outperform standard
multi-head attention in terms of efficiency and learning capabilities, thereby
improving the performance and broader deployability of Transformer models. Our
first contribution is Optimised Attention, which performs similarly to standard
attention, but has 3/4 as many parameters and one matrix multiplication fewer
per head. Next, we introduce Efficient Attention, which performs on par with
standard attention with only 1/2 as many parameters as many parameters and two
matrix multiplications fewer per head and is up to twice as fast as standard
attention. Lastly, we introduce Super Attention, which surpasses standard
attention by a significant margin in both vision and natural language
processing tasks while having fewer parameters and matrix multiplications. In
addition to providing rigorous mathematical comparisons, we evaluate the
presented attention mechanisms on MNIST, CIFAR100, IMDB Movie Reviews, and
Amazon Reviews datasets.

我们引入了三种新的注意力机制，比标准的多头注意力在效率和学习能力方面表现更好，从而提高了 Transformer 模型的性能和广泛部署能力。我们的第一个贡献是优化的注意力，它在头部数量、参数数量和矩阵乘法数量上与标准注意力相近，但参数数量少了 3/4，每个头部少了一次矩阵乘法。接下来，我们介绍了高效的注意力，它在参数数量上只有标准注意力的一半，每个头部少了两次矩阵乘法，并且速度是标准注意力的两倍。最后，我们介绍了超级注意力，在视觉和自然语言处理任务中显著超过标准注意力，同时具有更少的参数和矩阵乘法。除了提供严谨的数学比较，我们还在 MNIST、CIFAR100、IMDB 电影评论和 Amazon 评论数据集上评估了所提出的注意力机制。