In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption -- we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at https://github.com/hao-ai-lab/vllm-ltr.git

该研究针对大型语言模型（LLM）推理中的调度问题，提出了一种新的基于学习排序的调度方法，以解决传统先到先服务（FCFS）策略引发的阻塞问题。研究表明，通过预测请求批次中输出长度的相对排名，可以显著改善调度效率，实现了聊天机器人服务延迟降低2.8倍和合成数据生成吞吐量提高6.5倍的显著性能提升。

通过学习排序实现高效的LLM调度