APIServe：大型语言模型推理的高效API支持

Feb, 2024

APIServe：大型语言模型推理的高效API支持

APIServe: Efficient API Support for Large-Language Model Inferencing

Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang

TL;DRAPIServe是第一个针对API增强型LLM的推理框架，可以减少由API调用引起的GPU资源浪费，提高整体服务吞吐量1.6倍，并比现有的LLM推理系统每秒完成2倍更多的请求。

Abstract

large language models are increasingly integrated with external tools and APIs like ChatGPT plugins to extend their capability beyond language-centric tasks. However, today's llm inference systems are designed fo