Large language models (LLMs) achieved remarkable performance across various
tasks. However, they face challenges in managing long documents and extended
conversations, due to significantly increased computational requirements, both
in memory and inference time, and potential context truncation when the input
exceeds the LLM's fixed context length. This paper proposes a method called
Selective Context that enhances the inference efficiency of LLMs by identifying
and pruning redundancy in the input context to make the input more compact. We
test our approach using common data sources requiring long context processing:
arXiv papers, news articles, and long conversations, on tasks of summarisation,
question answering, and response generation. Experimental results show that
Selective Context significantly reduces memory cost and decreases generation
latency while maintaining comparable performance compared to that achieved when
full context is used. Specifically, we achieve a 50\% reduction in context
cost, resulting in a 36\% reduction in inference memory usage and a 32\%
reduction in inference time, while observing only a minor drop of .023 in
BERTscore and .038 in faithfulness on four downstream applications, indicating
that our method strikes a good balance between efficiency and performance.

使用选择性上下文方法（Selective Context）可以显著提高大型语言模型（LLMs）的推理效率，减少内存占用和推理时间，并在维持可比较性能的基础上实现对上下文成本的 50％降低，36％的推理内存使用率降低以及 32％的推理时间降低。

压缩上下文以增强大型语言模型的推理效率

Compressing Context to Enhance Inference Efficiency of Large Language  Models

Large language models (LLMs) have received significant attention by achieving
remarkable performance across various tasks. However, their fixed context
length poses challenges when processing long documents or maintaining extended
conversations. This paper proposes a method called \textit{Selective Context}
that employs self-information to filter out less informative content, thereby
enhancing the efficiency of the fixed context length. We demonstrate the
effectiveness of our approach on tasks of summarisation and question answering
across different data sources, including academic papers, news articles, and
conversation transcripts.

本文提出了一种名为 “选择性上下文” 的方法，利用自身信息来过滤 less informative 的内容，并在不同数据源上展示了提高固定上下文长度效率的有效性。