The computational challenges of Large Language Model (LLM) inference remain a
significant barrier to their widespread deployment, especially as prompt
lengths continue to increase. Due to the quadratic complexity of the attention
computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens
(i.e., the pre-filling stage) on a single A100 GPU. Existing methods for
speeding up prefilling often fail to maintain acceptable accuracy or efficiency
when applied to long-context LLMs. To address this gap, we introduce MInference
(Milliontokens Inference), a sparse calculation method designed to accelerate
pre-filling of long-sequence processing. Specifically, we identify three unique
patterns in long-context attention matrices-the A-shape, Vertical-Slash, and
Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We
determine the optimal pattern for each attention head offline and dynamically
build sparse indices based on the assigned pattern during inference. With the
pattern and sparse indices, we perform efficient sparse attention calculations
via our optimized GPU kernels to significantly reduce the latency in the
pre-filling stage of long-context LLMs. Our proposed technique can be directly
applied to existing LLMs without any modifications to the pre-training setup or
additional fine-tuning. By evaluating on a wide range of downstream tasks,
including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models
including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we
demonstrate that MInference effectively reduces inference latency by up to 10x
for pre-filling on an A100, while maintaining accuracy. Our code is available
at this https URL

通过识别长上下文注意力矩阵中的独特模式（A 形、垂直斜线和稀疏块），并利用 GPU 上的稀疏计算方法，我们提出了 MInference（百万令牌推理），以显著减少长上下文大型语言模型的预填充阶段的延迟。

MInference 1.0: 通过动态稀疏注意力加速长上下文 LLM 的预填充

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via  Dynamic Sparse Attention

During inference for transformer-based large language models (LLM),
prefilling is the computation of the key-value (KV) cache for input tokens in
the prompt prior to autoregressive generation. For longer input prompt lengths,
prefilling will incur a significant overhead on decoding time. In this work, we
highlight the following pitfall of prefilling: for batches containing
high-varying prompt lengths, significant computation is wasted by the standard
practice of padding sequences to the maximum length. As LLMs increasingly
support longer context lengths, potentially up to 10 million tokens, variations
in prompt lengths within a batch become more pronounced. To address this, we
propose Prepacking, a simple yet effective method to optimize prefilling
computation. To avoid redundant computation on pad tokens, prepacking combines
prompts of varying lengths into a sequence and packs multiple sequences into a
compact batch using a bin-packing algorithm. It then modifies the attention
mask and positional encoding to compute multiple prefilled KV-caches for
multiple prompts within a single sequence. On standard curated dataset
containing prompts with varying lengths, we obtain a significant speed and
memory efficiency improvements as compared to the default padding-based
prefilling computation within Huggingface across a range of base model
configurations and inference serving scenarios.

使用 Prepacking 方法优化 transformer-based 大型语言模型的 prefilling 计算，通过将不同长度的输入 prompt 组合成一个序列，并使用 bin-packing 算法将多个序列打包成一个紧凑的批次，从而减少冗余计算和提高内存效率。