We present a comprehensive evaluation of answer quality in
Retrieval-Augmented Generation (RAG) applications using vRAG-Eval, a novel
grading system that is designed to assess correctness, completeness, and
honesty. We further map the grading of quality aspects aforementioned into a
binary score, indicating an accept or reject decision, mirroring the intuitive
"thumbs-up" or "thumbs-down" gesture commonly used in chat applications. This
approach suits factual business settings where a clear decision opinion is
essential. Our assessment applies vRAG-Eval to two Large Language Models
(LLMs), evaluating the quality of answers generated by a vanilla RAG
application. We compare these evaluations with human expert judgments and find
a substantial alignment between GPT-4's assessments and those of human experts,
reaching 83% agreement on accept or reject decisions. This study highlights the
potential of LLMs as reliable evaluators in closed-domain, closed-ended
settings, particularly when human evaluations require significant resources.

我们提出了一种全面评估检索增强生成（RAG）应用中答案质量的方法，使用 vRAG-Eval，这是一种新的评分系统，旨在评估正确性、完整性和诚实性。我们进一步将前述质量方面的评分转化为一个二进制分数，表示接受或拒绝的决策，反映了常用于聊天应用的直观 “赞” 或 “踩” 的手势。我们将 vRAG-Eval 应用于两个大型语言模型（LLM），评估由基本 RAG 应用生成的答案的质量。我们将这些评估与人类专家判断进行比较，并发现 GPT-4 的评估结果与人类专家的评判具有显著一致性，在接受或拒绝的决策上达成 83% 的一致。这项研究突出了 LLM 在封闭领域、封闭式问题设置中作为可靠评估者的潜力，特别是当人工评估需要大量资源时。