面向低比特通信的张量并行大语言模型推理

Nov, 2024

面向低比特通信的张量并行大语言模型推理

Towards Low-bit Communication for Tensor Parallel LLM Inference

Harry Dong, Tyler Johnson, Minsik Cho, Emad Soroush

TL;DR该研究旨在解决服务器大型语言模型推理中日益增长的通信成本问题。提出了一种新的量化方法，能够将通信值的比特数从16位减少到4.2位，同时几乎保留原始性能。研究结果显示，该方法平均能够保持约98.0%和99.5%的原始性能，具有显著的应用潜力。

Abstract

Tensor Parallelism provides an effective way to increase server large language model (LLM) inference efficiency despite adding an additional Communication Cost. However, as server LLMs continue to scale in size,