Transformer, 并行计算, 和对数深度

Feb, 2024

Transformers, parallel computation, and logarithmic depth

Clayton Sanford, Daniel Hsu, Matus Telgarsky

TL;DR在这篇论文中，我们展示了自注意力层的数量可以高效地模拟和被大规模并行计算的常数通信轮次所模拟。因此，我们证明对于transformer来说，对于其他多个神经序列模型和次二次方变压器逼近算法无法高效解决的基本计算任务，对数深度是足够的。我们因此将并行性建立为变压器的一个关键区别特性。

Abstract

We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of →