Transformer-based models have emerged as one of the most widely used
architectures for natural language processing, natural language generation, and
image generation. The size of the state-of-the-art models has increased
steadily reaching billions of parameters. These huge models are memory hungry
and incur significant inference latency even on cutting edge AI-accelerators,
such as GPUs. Specifically, the time and memory complexity of the attention
operation is quadratic in terms of the total context length, i.e., prompt and
output tokens. Thus, several optimizations such as key-value tensor caching and
FlashAttention computation have been proposed to deliver the low latency
demands of applications relying on such large models. However, these techniques
do not cater to the computationally distinct nature of different phases during
inference.
To that end, we propose LeanAttention, a scalable technique of computing
self-attention for the token-generation phase (decode-phase) of decoder-only
transformer models. LeanAttention enables scaling the attention mechanism
implementation for the challenging case of long context lengths by re-designing
the execution flow for the decode-phase. We identify that the associative
property of online softmax can be treated as a reduction operation thus
allowing us to parallelize the attention computation over these large context
lengths. We extend the "stream-K" style reduction of tiled calculation to
self-attention to enable parallel computation resulting in an average of 2.6x
attention execution speedup over FlashAttention-2 and up to 8.33x speedup for
512k context lengths.

LeanAttention 是一种可扩展的自注意力计算技术，通过重新设计解码阶段的执行流程，将自注意力机制的实现扩展到具有挑战性的长上下文长度情况，以并行计算的方式提供 2.6 倍的平均注意力执行加速和最多 8.33 倍的速度提升。

精简注意力：面向 Transformer 解码阶段的硬件感知可扩展注意力机制

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the  Decode-Phase of Transformers

Transformer-based large language models (LLMs) are now deployed to hundreds
of millions of users. LLM inference is commonly performed on batches of
sequences that share a prefix, such as few-shot examples or a chatbot system
prompt. Decoding in this large-batch setting can be bottlenecked by the
attention operation, which reads large key-value (KV) caches from memory and
computes inefficient matrix-vector products for every sequence in the batch. In
this work, we introduce Hydragen, a hardware-aware exact implementation of
attention with shared prefixes. Hydragen computes attention over the shared
prefix and unique suffixes separately. This decomposition enables efficient
prefix attention by batching queries together across sequences, reducing
redundant memory reads and enabling the use of hardware-friendly matrix
multiplications. Our method can improve end-to-end LLM throughput by up to 32x
against competitive baselines, with speedup growing with the batch size and
shared prefix length. Hydragen also enables the use of very long shared
contexts: with a high batch size, increasing the prefix length from 1K to 16K
tokens decreases Hydragen throughput by less than 15%, while the throughput of
baselines drops by over 90%. Hydragen generalizes beyond simple prefix-suffix
decomposition and can be applied to tree-based prompt sharing patterns,
allowing us to further reduce inference time on competitive programming
problems by 55%.

基于转换器的大型语言模型现已应用于数亿用户。本文提出了 Hydragen，一种有硬件感知的精确关注力实现，它对共享前缀和唯一后缀分别计算注意力。该方法可以提高最多 32 倍的端到端语言模型吞吐量，并能使用非常长的共享上下文。

Hydragen：具有共享前缀的高吞吐量 LLM 推理

Hydragen: High-Throughput LLM Inference with Shared Prefixes

Though early successes of Statistical Machine Translation (SMT) systems are
attributed in part to the explicit modelling of the interaction between any two
source and target units, e.g., alignment, the recent Neural Machine Translation
(NMT) systems resort to the attention which partially encodes the interaction
for efficiency. In this paper, we employ Joint Representation that fully
accounts for each possible interaction. We sidestep the inefficiency issue by
refining representations with the proposed efficient attention operation. The
resulting Reformer models offer a new Sequence-to- Sequence modelling paradigm
besides the Encoder-Decoder framework and outperform the Transformer baseline
in either the small scale IWSLT14 German-English, English-German and IWSLT15
Vietnamese-English or the large scale NIST12 Chinese-English translation tasks
by about 1 BLEU point.We also propose a systematic model scaling approach,
allowing the Reformer model to beat the state-of-the-art Transformer in IWSLT14
German-English and NIST12 Chinese-English with about 50% fewer parameters. The
code is publicly available at this https URL

本研究提出一种基于联合表示的神经机器翻译模型，通过提出的高效注意力机制对表示进行精细化处理，实现了新的序列到序列建模范式并在多项机器翻译任务中取得了更优结果，同时提出了系统的模型放大方法，成功将模型规模缩小 50%，同时取得更高的翻译品质。