Aug, 2023
SARATHI:通过分块填充与顺便解码提高LLM推理效率
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked
Prefills
TL;DRSARATHI improves Large Language Model (LLM) inference performance by employing chunked-prefills and decode-maximal batching, resulting in significant throughput improvements and reduced pipeline bubbles when used with pipeline parallelism on GPUs.