BriefGPT.xyz
Nov, 2024
面向低比特通信的张量并行大语言模型推理
Towards Low-bit Communication for Tensor Parallel LLM Inference
HTML
PDF
Harry Dong, Tyler Johnson, Minsik Cho, Emad Soroush
TL;DR
该研究旨在解决服务器大型语言模型推理中日益增长的通信成本问题。提出了一种新的量化方法,能够将通信值的比特数从16位减少到4.2位,同时几乎保留原始性能。研究结果显示,该方法平均能够保持约98.0%和99.5%的原始性能,具有显著的应用潜力。
Abstract
Tensor Parallelism
provides an effective way to increase server large language model (LLM) inference efficiency despite adding an additional
Communication Cost
. However, as server LLMs continue to scale in size,
→