Inference serving for large language models (LLMs) is the key to unleashing
their potential in people's daily lives. However, efficient LLM serving remains
challenging today because the requests are inherently heterogeneous and
unpredictable in terms of resource and latency requirements, as a result of the
diverse applications and the dynamic execution nature of LLMs. Existing systems
are fundamentally limited in handling these characteristics and cause problems
such as severe queuing delays, poor tail latencies, and SLO violations.
We introduce Llumnix, an LLM serving system that reacts to such heterogeneous
and unpredictable requests by runtime rescheduling across multiple model
instances. Similar to context switching across CPU cores in modern operating
systems, Llumnix reschedules requests to improve load balancing and isolation,
mitigate resource fragmentation, and differentiate request priorities and SLOs.
Llumnix implements the rescheduling with an efficient and scalable live
migration mechanism for requests and their in-memory states, and exploits it in
a dynamic scheduling policy that unifies the multiple rescheduling scenarios
elegantly. Our evaluations show that Llumnix improves tail latencies by an
order of magnitude, accelerates high-priority requests by up to 1.5x, and
delivers up to 36% cost savings while achieving similar tail latencies,
compared against state-of-the-art LLM serving systems. Llumnix is publicly
available at this https URL

Llumnix 是一种用于大型语言模型（LLMs）服务的系统，通过在多个模型实例之间进行运行时重新调度，以应对异构且不可预测的请求，从而改善尾延迟，加快高优先级请求，并实现成本节省。

Llumnix: 大规模语言模型服务的动态调度

Llumnix: Dynamic Scheduling for Large Language Model Serving

With the ubiquitous use of modern large language models (LLMs) across
industries, the inference serving for these models is ever expanding. Given the
high compute and memory requirements of modern LLMs, more and more
top-of-the-line GPUs are being deployed to serve these models. Energy
availability has come to the forefront as the biggest challenge for data center
expansion to serve these models. In this paper, we present the trade-offs
brought up by making energy efficiency the primary goal of LLM serving under
performance SLOs. We show that depending on the inputs, the model, and the
service-level agreements, there are several knobs available to the LLM
inference provider to use for being energy efficient. We characterize the
impact of these knobs on the latency, throughput, as well as the energy. By
exploring these trade-offs, we offer valuable insights into optimizing energy
usage without compromising on performance, thereby paving the way for
sustainable and cost-effective LLM deployment in data center environments.

本文对大型语言模型（LLMs）的推理服务中能源效率的权衡进行了研究，通过探索延迟、吞吐量和能源之间的平衡，提供了优化能源使用的有价值见解，为数据中心环境中可持续且具有成本效益的 LLM 部署铺平了道路。