Aug, 2023

SARATHI:通过分块填充与顺便解码提高LLM推理效率

TL;DRSARATHI improves Large Language Model (LLM) inference performance by employing chunked-prefills and decode-maximal batching, resulting in significant throughput improvements and reduced pipeline bubbles when used with pipeline parallelism on GPUs.