Transformers are slow and memory-hungry on long sequences, since the time and
memory complexity of self-attention are quadratic in sequence length.
Approximate attention methods have attempted to address this problem by trading
off model quality to reduce the compute complexity, but often do not achieve
wall-clock speedup. We argue that a missing principle is making attention
algorithms IO-aware -- accounting for reads and writes between levels of GPU
memory. We propose FlashAttention, an IO-aware exact attention algorithm that
uses tiling to reduce the number of memory reads/writes between GPU high
bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of
FlashAttention, showing that it requires fewer HBM accesses than standard
attention, and is optimal for a range of SRAM sizes. We also extend
FlashAttention to block-sparse attention, yielding an approximate attention
algorithm that is faster than any existing approximate attention method.
FlashAttention trains Transformers faster than existing baselines: 15%
end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the
MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K),
and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention
and block-sparse FlashAttention enable longer context in Transformers, yielding
higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on
long-document classification) and entirely new capabilities: the first
Transformers to achieve better-than-chance performance on the Path-X challenge
(seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1%
accuracy).

提出了 FlashAttention，一种 IO 感知的确切关注算法。FlashAttention 使用平铺减少 GPU 高带宽内存（HBM）和 GPU 片上 SRAM 之间的内存读取 / 写入次数，并可扩展为块状稀疏关注。FlashAttention 使 Transformers 速度提高，使其具有更长的上下文并获得更高质量的模型，以及实现了 Path-X 挑战的首个 Transformers。

FlashAttention: 带 IO 感知的快速、节省内存的精确注意力机制

FlashAttention: Fast and Memory-Efficient Exact Attention with  IO-Awareness

We propose a novel type of balanced clustering algorithm to approximate
attention. Attention complexity is reduced from $O(N^2)$ to $O(N \log N)$,
where $N$ is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive
Hashing (LSH) in a novel way by defining new Asymmetric transformations and an
adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF
is that it can be used as a drop-in replacement for dense attention layers
without any retraining. On the contrary, prior fast attention methods impose
constraints (e.g. queries and keys share the same vector representations) and
require re-training from scratch. We apply our method to pre-trained
state-of-the-art Natural Language Processing and Computer Vision models and we
report significant memory and speed benefits. Notably, SMYRF-BERT outperforms
(slightly) BERT on GLUE, while using $50\%$ less memory. We also show that
SMYRF can be used interchangeably with dense attention before and after
training. Finally, we use SMYRF to train GANs with attention in high
resolutions. Using a single TPU, we were able to scale attention to 128x128=16k
and 256x256=65k tokens on BigGAN on CelebA-HQ.

我们提出了一种新型的平衡聚类算法 SMYRF，通过使用局部敏感哈希算法和一系列新异构变换，实现了由 O（N ^ 2）到 O（N log N）的注意力复杂度的有效减少，并在不需要重新训练的情况下拥有良好的性能表现。