Long-context modeling capabilities have garnered widespread attention,
leading to the emergence of Large Language Models (LLMs) with ultra-context
windows. Meanwhile, benchmarks for evaluating long-context LLMs are gradually
catching up. However, existing benchmarks employ irrelevant noise texts to
artificially extend the length of test cases, diverging from the real-world
scenarios of long-context applications. To bridge this gap, we propose a novel
long-context benchmark, Loong, aligning with realistic scenarios through
extended multi-document question answering (QA). Unlike typical document QA, in
Loong's test cases, each document is relevant to the final answer, ignoring any
document will lead to the failure of the answer. Furthermore, Loong introduces
four types of tasks with a range of context lengths: Spotlight Locating,
Comparison, Clustering, and Chain of Reasoning, to facilitate a more realistic
and comprehensive evaluation of long-context understanding. Extensive
experiments indicate that existing long-context language models still exhibit
considerable potential for enhancement. Retrieval augmented generation (RAG)
achieves poor performance, demonstrating that Loong can reliably assess the
model's long-context modeling capabilities.

提出了一个新的长上下文基准测试 Loong，通过扩展的多文档问题回答来实现与现实场景的对齐，来评估模型的长上下文建模能力。

不留下任何文件：扩展多文档问答中的长上下文语言模型基准测试

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended  Multi-Doc QA

Long-context modeling capabilities are important for large language models
(LLMs) in various applications. However, directly training LLMs with long
context windows is insufficient to enhance this capability since some training
samples do not exhibit strong semantic dependencies across long contexts. In
this study, we propose a data mining framework \textbf{ProLong} that can assign
each training sample with a long dependency score, which can be used to rank
and filter samples that are more advantageous for enhancing long-context
modeling abilities in LLM training. Specifically, we first use delta perplexity
scores to measure the \textit{Dependency Strength} between text segments in a
given document. Then we refine this metric based on the \textit{Dependency
Distance} of these segments to incorporate spatial relationships across
long-contexts. Final results are calibrated with a \textit{Dependency
Specificity} metric to prevent trivial dependencies introduced by repetitive
patterns. Moreover, a random sampling approach is proposed to optimize the
computational efficiency of ProLong. Comprehensive experiments on multiple
benchmarks indicate that ProLong effectively identifies documents that carry
long dependencies and LLMs trained on these documents exhibit significantly
enhanced long-context modeling capabilities.

提出了一个名为 ProLong 的数据挖掘框架，该框架可以在大型语言模型（LLMs）的训练中分配每个样本一个长依赖得分，用于排名和过滤对增强长上下文建模能力更为有利的样本，实验结果表明，ProLong 能够有效识别具有长依赖关系的文档，并且在这些文档上训练的 LLMs 显著提高了长上下文建模能力。

长上下文真不算长：大语言模型的长依赖数据勘探者

Long Context is Not Long at All: A Prospector of Long-Dependency Data  for Large Language Models

Long-context modeling presents a significant challenge for transformer-based
large language models (LLMs) due to the quadratic complexity of the
self-attention mechanism and issues with length extrapolation caused by
pretraining exclusively on short inputs. Existing methods address computational
complexity through techniques such as text chunking, the kernel approach, and
structured attention, and tackle length extrapolation problems through
positional encoding, continued pretraining, and data engineering. These
approaches typically require $\textbf{sequential access}$ to the document,
necessitating reading from the first to the last token. We contend that for
goal-oriented reading of long documents, such sequential access is not
necessary, and a proficiently trained model can learn to omit hundreds of less
pertinent tokens. Inspired by human reading behaviors and existing empirical
observations, we propose $\textbf{random access}$, a novel reading strategy
that enables transformers to efficiently process long documents without
examining every token. Experimental results from pretraining, fine-tuning, and
inference phases validate the efficacy of our method.

长文本建模提出了对基于 Transformer 的大型语言模型（LLMs）的重大挑战，我们提出了一种新的阅读策略，即随机访问，可以使 Transformer 模型在处理长文档时高效地跳过不相关的标记。通过预训练、微调和推理阶段的实验证明了我们方法的有效性。