Recent studies have demonstrated that large language models (LLMs) exhibit significant biases in evaluation tasks, particularly in preferentially rating and favoring self-generated content. However, the extent to which this bias manifests in fact-oriented tasks, especially within retrieval-augmented generation (RAG) frameworks-where keyword extraction and factual accuracy take precedence over stylistic elements-remains unclear. Our study addresses this knowledge gap by simulating two critical phases of the RAG framework. In the first phase, we access the suitability of human-authored versus model-generated passages, emulating the pointwise reranking process. The second phase involves conducting pairwise reading comprehension tests to simulate the generation process. Contrary to previous findings indicating a self-preference in rating tasks, our results reveal no significant self-preference effect in RAG frameworks. Instead, we observe that factual accuracy significantly influences LLMs' output, even in the absence of prior knowledge. Our research contributes to the ongoing discourse on LLM biases and their implications for RAG-based system, offering insights that may inform the development of more robust and unbiased LLM systems.

本研究解决了大型语言模型（LLMs）在检索增强生成（RAG）框架中偏见评估的知识空白。通过模拟RAG的两个关键阶段，研究发现与以往结果不同，LLMs在RAG框架中没有显著的自我偏好效应，而是证明了事实准确性在模型输出中具有重要影响。这一发现有助于推动对LLMs偏见的理解，并为开发更鲁棒的LLM系统提供了启示。

大型语言模型在检索增强生成中的偏见评估