Current Retrieval-Augmented Generation (RAG) systems concatenate and process numerous retrieved document chunks for prefill which requires a large volume of computation, therefore leading to significant latency in time-to-first-token (TTFT). To reduce the computation overhead as well as TTFT, we introduce TurboRAG, a novel RAG system that redesigns the inference paradigm of the current RAG system by first pre-computing and storing the key-value (KV) caches of documents offline, and then directly retrieving the saved KV cache for prefill. Hence, online computation of KV caches is eliminated during inference. In addition, we provide a number of insights into the mask matrix and positional embedding mechanisms, plus fine-tune a pretrained language model to maintain model accuracy of TurboRAG. Our approach is applicable to most existing large language models and their applications without any requirement in modification of models and inference systems. Experimental results across a suite of RAG benchmarks demonstrate that TurboRAG reduces TTFT by up to 9.4x compared to the conventional RAG systems (on an average of 8.6x), but reserving comparable performance to the standard RAG systems.

本研究解决了当前检索增强生成（RAG）系统在处理检索文档块时高计算量和延迟的问题。提出的TurboRAG系统通过离线预计算和存储文档的键值（KV）缓存，从而消除了在线推理中的KV缓存计算，显著减少了首次标记的时间延迟，同时保持了模型的精度。实验结果表明，TurboRAG在多个基准测试中将TTFT减少了最高9.4倍，平均减少了8.6倍，与传统RAG系统相比，性能相当。

TurboRAG：通过预计算KV缓存加速分块文本的检索增强生成