Long context capability is a crucial competency for large language models
(LLMs) as it mitigates the human struggle to digest long-form texts. This
capability enables complex task-solving scenarios such as book summarization,
code assistance, and many more tasks that are traditionally manpower-intensive.
However, transformer-based LLMs face significant challenges with long context
input due to the growing size of the KV cache and the intrinsic complexity of
attending to extended inputs; where multiple schools of efficiency-driven
approaches -- such as KV cache quantization, token dropping, prompt
compression, linear-time sequence models, and hybrid architectures -- have been
proposed to produce efficient yet long context-capable models. Despite these
advancements, no existing work has comprehensively benchmarked these methods in
a reasonably aligned environment. In this work, we fill this gap by providing a
taxonomy of current methods and evaluating 10+ state-of-the-art approaches
across seven categories of long context tasks. Our work reveals numerous
previously unknown phenomena and offers insights -- as well as a friendly
workbench -- for the future development of long context-capable LLMs. The
source code will be available at this https URL

大语言模型的长上下文能力是其关键能力之一，本研究填补了现有方法的空白，并评估了 10 多种最新方法在长上下文任务领域的表现，揭示了许多以前未知的现象，为未来长上下文能力大语言模型的发展提供了洞见和工作平台。

KV 缓存压缩，我们必须拿什么作为交换？长上下文能力方法的全面基准测试

KV Cache Compression, But What Must We Give in Return? A Comprehensive  Benchmark of Long Context Capable Approaches

Extending large language models to effectively handle long contexts requires
instruction fine-tuning on input sequences of similar length. To address this,
we present LongAlign -- a recipe of the instruction data, training, and
evaluation for long context alignment. First, we construct a long
instruction-following dataset using Self-Instruct. To ensure the data
diversity, it covers a broad range of tasks from various long context sources.
Second, we adopt the packing and sorted batching strategies to speed up
supervised fine-tuning on data with varied length distributions. Additionally,
we develop a loss weighting method to balance the contribution to the loss
across different sequences during packing training. Third, we introduce the
LongBench-Chat benchmark for evaluating instruction-following capabilities on
queries of 10k-100k in length. Experiments show that LongAlign outperforms
existing recipes for LLMs in long context tasks by up to 30\%, while also
maintaining their proficiency in handling short, generic tasks. The code, data,
and long-aligned models are open-sourced at this https URL

扩展大型语言模型以有效处理长篇背景需要依据相似长度的输入序列进行指导微调，本文提出了 LongAlign 框架，包括长篇背景对齐的指导数据、训练和评估方法，通过 Self-Instruct 构建了包含各种长篇背景任务的数据集，采用打包和排序批处理策略加快有差异长度分布的数据的监督微调，引入了损失权重方法以平衡打包训练过程中不同序列对损失的贡献，并引入了 LongBench-Chat 测试基准来评估对 1 万至 10 万字查询的指导跟进能力，实验证明 LongAlign 在长篇背景任务中性能比现有的大型语言模型框架提升了 30％，同时保持了对短语、通用任务的熟练处理能力。