BriefGPT.xyz
Jan, 2024
FP6-LLM: 通过FP6中心算法系统共同设计高效服务大型语言模型
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
HTML
PDF
Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao...
TL;DR
通过提出的TC-FPx全栈GPU核心设计方案,结合张量核心支持,为量化的大型语言模型推理提供全新的端到端支持(称为FP6-LLM),实现了推理成本和模型质量之间的更好平衡。
Abstract
six-bit quantization
(FP6) can effectively reduce the size of
large language models
(LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide
→