BriefGPT.xyz
Aug, 2024
通过学习排序实现高效的LLM调度
Efficient LLM Scheduling by Learning to Rank
HTML
PDF
Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica...
TL;DR
该研究针对大型语言模型(LLM)推理中的调度问题,提出了一种新的基于学习排序的调度方法,以解决传统先到先服务(FCFS)策略引发的阻塞问题。研究表明,通过预测请求批次中输出长度的相对排名,可以显著改善调度效率,实现了聊天机器人服务延迟降低2.8倍和合成数据生成吞吐量提高6.5倍的显著性能提升。
Abstract
In
Large Language Model
(LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS)
Scheduling
→