In recent years, the input context sizes of large language models (LLMs) have
increased dramatically. However, existing evaluation methods have not kept
pace, failing to comprehensively assess the efficiency of models in handling
long contexts. To bridge this gap, we introduce the BABILong benchmark,
designed to test language models' ability to reason across facts distributed in
extremely long documents. BABILong includes a diverse set of 20 reasoning
tasks, including fact chaining, simple induction, deduction, counting, and
handling lists/sets. These tasks are challenging on their own, and even more
demanding when the required facts are scattered across long natural text. Our
evaluations show that popular LLMs effectively utilize only 10-20\% of the
context and their performance declines sharply with increased reasoning
complexity. Among alternatives to in-context reasoning, Retrieval-Augmented
Generation methods achieve a modest 60\% accuracy on single-fact question
answering, independent of context length. Among context extension methods, the
highest performance is demonstrated by recurrent memory transformers, enabling
the processing of lengths up to 11 million tokens. The BABILong benchmark is
extendable to any length to support the evaluation of new upcoming models with
increased capabilities, and we provide splits up to 1 million token lengths.

在这项研究中，我们介绍了 BABILong 基准测试，用于评估大型语言模型在处理长上下文时的效率。评估结果表明，目前流行的语言模型仅有效地利用上下文的 10-20％，并且在处理复杂的推理任务时性能急剧下降。在上下文推理的替代方法中，使用检索增强生成方法能够以最高 60％的准确率回答单个事实问题，而与上下文长度无关。对于上下文扩展方法，采用循环记忆变压器可以处理长度达 1100 万个标记。BABILong 基准测试可以扩展到任意长度，以支持评估具有更强能力的新模型，并为 1 百万个标记长度提供了分割。

BABILong: 长篇背景下的 LLMs 极限测试和筛选

BABILong: Testing the Limits of LLMs with Long Context  Reasoning-in-a-Haystack

This paper addresses the challenge of processing long documents using
generative transformer models. To evaluate different approaches, we introduce
BABILong, a new benchmark designed to assess model capabilities in extracting
and processing distributed facts within extensive texts. Our evaluation, which
includes benchmarks for GPT-4 and RAG, reveals that common methods are
effective only for sequences up to $10^4$ elements. In contrast, fine-tuning
GPT-2 with recurrent memory augmentations enables it to handle tasks involving
up to $10^7$ elements. This achievement marks a substantial leap, as it is by
far the longest input processed by any open neural network model to date,
demonstrating a significant improvement in the processing capabilities for long
sequences.

本研究论文通过引入 BABILong 基准来评估模型在提取和处理长文本中分布式事实的能力，发现传统方法只适用于长度为 10^4 的序列，而使用细调 GPT-2 与循环记忆增强可以处理长度为 10^7 元素的任务，这一成就大大提高了长序列处理能力。