BriefGPT.xyz
Mar, 2024
使用Sarathi-Serve调节LLM推理中的吞吐量-延迟平衡
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
HTML
PDF
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra...
TL;DR
介绍了一种高效的LLM推理调度器Sarathi-Serve,通过利用来自Sarathi的分块预填充技术,创建无停顿的调度,可以在正在进行的解码过程中批量添加新的请求,从而提高吞吐量,同时将对延迟的影响降至最低。
Abstract
Each
llm
serving request goes through two phases. The first is
prefill
which processes the entire input prompt to produce one output token and the second is
→