BriefGPT.xyz
Jul, 2024
MInference 1.0:通过动态稀疏注意力加速长上下文LLM的预填充
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
HTML
PDF
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo...
TL;DR
通过识别长上下文注意力矩阵中的独特模式(A形、垂直斜线和稀疏块),并利用GPU上的稀疏计算方法,我们提出了MInference(百万令牌推理),以显著减少长上下文大型语言模型的预填充阶段的延迟。
Abstract
The computational challenges of
large language model
(LLM)
inference
remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity
→