Retrieval-augmented generation (RAG) is a promising approach to address the limitations of fixed knowledge in large language models (LLMs). However, current benchmarks for evaluating RAG systems suffer from two key deficiencies: (1) they fail to adequately measure LLMs' capability in handling long-context retrieval due to a lack of datasets that reflect the characteristics of retrieved documents, and (2) they lack a comprehensive evaluation method for assessing LLMs' ability to generate long-form responses that effectively exploits retrieved information. To address these shortcomings, we introduce the Long$^2$RAG benchmark and the Key Point Recall (KPR) metric. Long$^2$RAG comprises 280 questions spanning 10 domains and across 8 question categories, each associated with 5 retrieved documents with an average length of 2,444 words. KPR evaluates the extent to which LLMs incorporate key points extracted from the retrieved documents into their generated responses, providing a more nuanced assessment of their ability to exploit retrieved information.

本研究针对当前检索增强生成（RAG）系统在长背景处理和长文本生成评估中的不足，提出了Long²RAG基准和关键点回想（KPR）指标。研究的主要发现表明，新基准和指标能够有效衡量大型语言模型在生成过程中如何利用检索信息，提高了评估的全面性与精准度。

Long²RAG：评估长文档和长背景检索增强生成的关键点回想