Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test
only surface-level retrieval capabilities, but how well can long-context LLMs
retrieve, synthesize, and reason over information across book-length inputs? We
address this question by creating NoCha, a dataset of 1,001 minimally different
pairs of true and false claims about 67 recently-published English fictional
books, written by human readers of those books. In contrast to existing
long-context benchmarks, our annotators confirm that the largest share of pairs
in NoCha require global reasoning over the entire book to verify. Our
experiments show that while human readers easily perform this task, it is
enormously challenging for all ten long-context LLMs that we evaluate: no
open-weight model performs above random chance (despite their strong
performance on synthetic benchmarks), while GPT-4o achieves the highest
accuracy at 55.8%. Further analysis reveals that (1) on average, models perform
much better on pairs that require only sentence-level retrieval vs. global
reasoning; (2) model-generated explanations for their decisions are often
inaccurate even for correctly-labeled claims; and (3) models perform
substantially worse on speculative fiction books that contain extensive
world-building. The methodology proposed in NoCha allows for the evolution of
the benchmark dataset and the easy analysis of future models.

通过创建 NoCha 数据集，我们评估了长文本 LLMs 在检索、综合和推理书籍等长篇输入上的能力，并发现其在全局推理方面普遍具有巨大挑战，并提出了一种能够演化基准数据集并分析未来模型的方法。