BriefGPT.xyz
Jun, 2024
SpecExec: 消费设备上基于大规模并行推测解码的交互式 LLM 推理
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
HTML
PDF
Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia...
TL;DR
使用SpecExec方法,在消费级GPU上以每秒4-6个令牌(4位量化)或每秒2-3个令牌(16位权重)的速度对包含50多亿参数的大语言模型进行了推理。
Abstract
As
large language models
gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM
inference
use
speculative decodi
→