In this paper, we introduce a novel low-latency inference framework for large
language models (LLMs) inference which enables LLMs to perform inferences with
incomplete prompts. By reallocating computational processes to prompt input
phase, we achieve a substantial reduction in latency, thereby significantly
enhancing the interactive experience for users of LLMs. The framework adeptly
manages the visibility of the streaming prompt to the model, allowing it to
infer from incomplete prompts or await additional prompts. Compared with
traditional inference methods that utilize complete prompts, our approach
demonstrates an average reduction of 59% in response latency on the MMLU-Pro
dataset, while maintaining comparable accuracy. Additionally, our framework
facilitates collaborative inference and output across different models. By
employing an LLM for inference and a small language model (SLM) for output, we
achieve an average 68% reduction in response latency, alongside a 5.5%
improvement in accuracy on the MMLU-Pro dataset compared with the SLM baseline.
For long prompts exceeding 20 sentences, the response latency can be reduced by
up to 93%.

本文介绍了一种用于大型语言模型（LLMs）的新型低延迟推断框架，使 LLMs 能够使用不完整的提示进行推断，并通过重新分配计算过程到提示输入阶段，实现了大幅度的延迟降低，从而显著提高用户与 LLMs 的交互体验。该框架灵活地管理模型对流式提示的可见性，允许它从不完整的提示中进行推断或等待附加提示。与使用完整提示的传统推断方法相比，我们的方法在 MMLU-Pro 数据集上表现出平均响应延迟减少 59％，同时保持相当的准确性。此外，我们的框架促进了不同模型之间的协同推断和输出。通过使用 LLM 进行推断和使用小型语言模型（SLM）进行输出，与 SLM 基线相比，我们在 MMLU-Pro 数据集上实现了平均响应延迟减少 68％，准确性提高了 5.5％。对于超过 20 个句子的长提示，响应延迟可以降低高达 93％。