Long-context modeling capabilities have garnered widespread attention,
leading to the emergence of Large Language Models (LLMs) with ultra-context
windows. Meanwhile, benchmarks for evaluating long-context LLMs are gradually
catching up. However, existing benchmarks employ irrelevant noise texts to
artificially extend the length of test cases, diverging from the real-world
scenarios of long-context applications. To bridge this gap, we propose a novel
long-context benchmark, Loong, aligning with realistic scenarios through
extended multi-document question answering (QA). Unlike typical document QA, in
Loong's test cases, each document is relevant to the final answer, ignoring any
document will lead to the failure of the answer. Furthermore, Loong introduces
four types of tasks with a range of context lengths: Spotlight Locating,
Comparison, Clustering, and Chain of Reasoning, to facilitate a more realistic
and comprehensive evaluation of long-context understanding. Extensive
experiments indicate that existing long-context language models still exhibit
considerable potential for enhancement. Retrieval augmented generation (RAG)
achieves poor performance, demonstrating that Loong can reliably assess the
model's long-context modeling capabilities.

提出了一个新的长上下文基准测试 Loong，通过扩展的多文档问题回答来实现与现实场景的对齐，来评估模型的长上下文建模能力。

不留下任何文件：扩展多文档问答中的长上下文语言模型基准测试

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended  Multi-Doc QA

Recently, large language models (LLMs) have shown remarkable capabilities
including understanding context, engaging in logical reasoning, and generating
responses. However, this is achieved at the expense of stringent computational
and memory requirements, hindering their ability to effectively support long
input sequences. This survey provides an inclusive review of the recent
techniques and methods devised to extend the sequence length in LLMs, thereby
enhancing their capacity for long-context understanding. In particular, we
review and categorize a wide range of techniques including architectural
modifications, such as modified positional encoding and altered attention
mechanisms, which are designed to enhance the processing of longer sequences
while avoiding a proportional increase in computational requirements. The
diverse methodologies investigated in this study can be leveraged across
different phases of LLMs, i.e., training, fine-tuning and inference. This
enables LLMs to efficiently process extended sequences. The limitations of the
current methodologies is discussed in the last section along with the
suggestions for future research directions, underscoring the importance of
sequence length in the continued advancement of LLMs.

本文调查了扩展序列长度的技术和方法，包括架构修改和注意机制的改变等多种方法，并讨论了当前方法的局限性和未来研究方向建议，强调了序列长度对大型语言模型进一步发展的重要性。

超越极限：大型语言模型中扩展上下文长度的技术综述

Beyond the Limits: A Survey of Techniques to Extend the Context Length  in Large Language Models

Large language models (LLMs), despite their impressive performance in various
language tasks, are typically limited to processing texts within context-window
size. This limitation has spurred significant research efforts to enhance LLMs'
long-context understanding with high-quality long-sequence benchmarks. However,
prior datasets in this regard suffer from shortcomings, such as short context
length compared to the context window of modern LLMs; outdated documents that
have data leakage problems; and an emphasis on short dependency tasks rather
than long dependency tasks. In this paper, we present LooGLE, a Long Context
Generic Language Evaluation benchmark for LLMs' long context understanding.
LooGLE features relatively new documents post-2022, with over 24,000 tokens per
document and 6,000 newly generated questions spanning diverse domains. Human
annotators meticulously crafted more than 1,100 high-quality question-answer
pairs to meet the long dependency requirements. These pairs underwent thorough
cross-validation, yielding the most precise assessment of LLMs' long dependency
capabilities. The evaluation of eight state-of-the-art LLMs on LooGLE revealed
key findings: (i) commercial models outperformed open-sourced models; (ii) LLMs
excelled in short dependency tasks like short question-answering and cloze
tasks but struggled with more intricate long dependency tasks; (iii) in-context
learning and chaining thoughts offered only marginal improvements; (iv)
retrieval-based techniques demonstrated substantial benefits for short
question-answering, while strategies for extending context window length had
limited impact on long context understanding. As such, LooGLE not only provides
a systematic and comprehensive evaluation schema on long-context LLMs, but also
sheds light on future development of enhanced models towards "true long-context
understanding".

基于 LooGLE 评估模型的表现，研究显示商业模型在短依赖任务上胜过开源模型，同时也揭示了长依赖任务的困难，并指出在短问答任务中检索式技术有着明显的好处，而扩展上下文窗口长度的策略对于长上下文理解的影响有限。