Effective attention modules have played a crucial role in the success of
Transformer-based large language models (LLMs), but the quadratic time and
memory complexities of these attention modules also pose a challenge when
processing long sequences. One potential solution for the long sequence problem
is to utilize distributed clusters to parallelize the computation of attention
modules across multiple devices (e.g., GPUs). However, adopting a distributed
approach inevitably introduces extra memory overheads to store local attention
results and incurs additional communication costs to aggregate local results
into global ones. In this paper, we propose a distributed attention framework
named ``BurstAttention'' to optimize memory access and communication operations
at both the global cluster and local device levels. In our experiments, we
compare BurstAttention with other competitive distributed attention solutions
for long sequence processing. The experimental results under different length
settings demonstrate that BurstAttention offers significant advantages for
processing long sequences compared with these competitive baselines, reducing
40% communication overheads and achieving 2 X speedup during training 32K
sequence length on 8 X A100.

我们提出了一种名为 “BurstAttention” 的分布式注意力框架，通过在全局集群和本地设备级别上优化内存访问和通信操作，相比于竞争的基准线，在处理长序列时减少 40% 的通信开销，训练 32K 序列长度时实现 2 倍加速。