This paper presents ServerlessLLM, a locality-enhanced serverless inference system for Large Language Models (LLMs). ServerlessLLM exploits the substantial capacity and bandwidth of storage and memory devices available on GPU servers, thereby reducing costly remote checkpoint downloads and achieving efficient checkpoint loading. ServerlessLLM achieves this through three main contributions: (i) fast LLM checkpoint loading via a novel loading-optimized checkpoint format design, coupled with an efficient multi-tier checkpoint loading system; (ii) locality-driven LLM inference with live migration, which allows ServerlessLLM to effectively achieve locality-driven server allocation while preserving the low latency of ongoing LLM inference; and (iii) locality-aware server allocation, enabling ServerlessLLM to evaluate the status of each server in a cluster and effectively schedule model startup time to capitalize on local checkpoint placement. Our comprehensive experiments, which include microbenchmarks and real-world traces, show that ServerlessLLM surpasses state-of-the-art systems by 10 - 200X in latency performance when running various LLM inference workloads.

本文介绍了ServerlessLLM，一种用于大型语言模型的增强本地化服务器推理系统。ServerlessLLM通过三个主要贡献实现了高效的检查点加载和推理：(i) 通过新颖的加载优化检查点格式设计和高效的多层检查点加载系统实现快速检查点加载；(ii) 基于本地化的推理和实时迁移，以在保持正在进行的推理的低延迟的同时有效实现本地化的服务器分配；以及 (iii) 考虑本地化的服务器分配，使ServerlessLLM能够评估集群中每个服务器的状态，并有效地调度模型的启动时间以发挥本地检查点放置的优势。我们进行的广泛实验，包括微基准测试和真实世界的追踪，表明当运行不同的语言模型推理工作负载时，ServerlessLLM的延迟性能超过了现有技术系统10-200倍。

ServerlessLLM: 针对大型语言模型的增强本地化无服务器推理