Retrieval-Augmented Generation (RAG) is a promising approach for mitigating
the hallucination of large language models (LLMs). However, existing research
lacks rigorous evaluation of the impact of retrieval-augmented generation on
different large language models, which make it challenging to identify the
potential bottlenecks in the capabilities of RAG for different LLMs. In this
paper, we systematically investigate the impact of Retrieval-Augmented
Generation on large language models. We analyze the performance of different
large language models in 4 fundamental abilities required for RAG, including
noise robustness, negative rejection, information integration, and
counterfactual robustness. To this end, we establish Retrieval-Augmented
Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and
Chinese. RGB divides the instances within the benchmark into 4 separate
testbeds based on the aforementioned fundamental abilities required to resolve
the case. Then we evaluate 6 representative LLMs on RGB to diagnose the
challenges of current LLMs when applying RAG. Evaluation reveals that while
LLMs exhibit a certain degree of noise robustness, they still struggle
significantly in terms of negative rejection, information integration, and
dealing with false information. The aforementioned assessment outcomes indicate
that there is still a considerable journey ahead to effectively apply RAG to
LLMs.

通过对 Retrieval-Augmented Generation 对大型语言模型的影响进行系统调查和评估，本文发现大型语言模型在噪音鲁棒性、负面拒绝、信息整合和对抗性鲁棒性方面存在挑战，表明在将 RAG 有效应用于大型语言模型方面仍有很长的路要走。

基于检索增强生成的大型语言模型的基准测试

Benchmarking Large Language Models in Retrieval-Augmented Generation

Recently, one critical issue looms large in the field of recommender systems
-- there are no effective benchmarks for rigorous evaluation -- which
consequently leads to unreproducible evaluation and unfair comparison. We,
therefore, conduct studies from the perspectives of practical theory and
experiments, aiming at benchmarking recommendation for rigorous evaluation.
Regarding the theoretical study, a series of hyper-factors affecting
recommendation performance throughout the whole evaluation chain are
systematically summarized and analyzed via an exhaustive review on 141 papers
published at eight top-tier conferences within 2017-2020. We then classify them
into model-independent and model-dependent hyper-factors, and different modes
of rigorous evaluation are defined and discussed in-depth accordingly. For the
experimental study, we release DaisyRec 2.0 library by integrating these
hyper-factors to perform rigorous evaluation, whereby a holistic empirical
study is conducted to unveil the impacts of different hyper-factors on
recommendation performance. Supported by the theoretical and experimental
studies, we finally create benchmarks for rigorous evaluation by proposing
standardized procedures and providing performance of ten state-of-the-arts
across six evaluation metrics on six datasets as a reference for later study.
Overall, our work sheds light on the issues in recommendation evaluation,
provides potential solutions for rigorous evaluation, and lays foundation for
further investigation.

本研究介绍了一种基于模型无关和模型相关超因素的推荐系统评估方法。研究通过全面回顾 141 篇发表在 2017-2020 年的顶级会议论文，系统总结并分析了影响推荐性能的超因素，并针对 10 种推荐算法和 6 种数据集进行了实验验证，最终建立了一个基准系统供后续研究参考。