Challenges in the automated evaluation of Retrieval-Augmented Generation
(RAG) Question-Answering (QA) systems include hallucination problems in
domain-specific knowledge and the lack of gold standard benchmarks for company
internal tasks. This results in difficulties in evaluating RAG variations, like
RAG-Fusion (RAGF), in the context of a product QA task at Infineon
Technologies. To solve these problems, we propose a comprehensive evaluation
framework, which leverages Large Language Models (LLMs) to generate large
datasets of synthetic queries based on real user queries and in-domain
documents, uses LLM-as-a-judge to rate retrieved documents and answers,
evaluates the quality of answers, and ranks different variants of
Retrieval-Augmented Generation (RAG) agents with RAGElo's automated Elo-based
competition. LLM-as-a-judge rating of a random sample of synthetic queries
shows a moderate, positive correlation with domain expert scoring in relevance,
accuracy, completeness, and precision. While RAGF outperformed RAG in Elo
score, a significance analysis against expert annotations also shows that RAGF
significantly outperforms RAG in completeness, but underperforms in precision.
In addition, Infineon's RAGF assistant demonstrated slightly higher performance
in document relevance based on MRR@5 scores. We find that RAGElo positively
aligns with the preferences of human annotators, though due caution is still
required. Finally, RAGF's approach leads to more complete answers based on
expert annotations and better answers overall based on RAGElo's evaluation
criteria.

针对检索增强生成（RAG）问答系统的自动化评估中存在的领域特定知识虚构问题和公司内部任务缺乏标准基准的挑战，我们提出了一个综合评估框架，利用大型语言模型（LLM）生成基于真实用户查询和领域内文档的大规模合成查询数据集，使用 LLM 作为评判者对检索的文档和答案进行评级，评估答案的质量，并使用 RAGElo 的自动 Elo 竞赛对不同变体的检索增强生成（RAG）代理进行排名。