BriefGPT.xyz
Jul, 2024
查找表量化LLM的快速矩阵乘法
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
HTML
PDF
Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing...
TL;DR
使用FLUTE内核可以提高大型语言模型的推理速度,尤其在权重非均匀、查找表量化的情况下,通过离线重构量化权重矩阵,最小化位操作,并通过向量化和查找表的复制来减轻共享内存带宽限制,可以使FLUTE内核比现有的GEMM内核快2-4倍。
Abstract
The deployment of
large language models
(LLMs) is often constrained by
memory bandwidth
, where the primary bottleneck is the cost of transferring model parameters from the GPU's global memory to its registers. Wh
→