We propose a new method to measure the task-specific accuracy of
Retrieval-Augmented Large Language Models (RAG). Evaluation is performed by
scoring the RAG on an automatically-generated synthetic exam composed of
multiple choice questions based on the corpus of documents associated with the
task. Our method is an automated, cost-efficient, interpretable, and robust
strategy to select the optimal components for a RAG system. We leverage Item
Response Theory (IRT) to estimate the quality of an exam and its
informativeness on task-specific accuracy. IRT also provides a natural way to
iteratively improve the exam by eliminating the exam questions that are not
sufficiently informative about a model's ability. We demonstrate our approach
on four new open-ended Question-Answering tasks based on Arxiv abstracts,
StackExchange questions, AWS DevOps troubleshooting guides, and SEC filings. In
addition, our experiments reveal more general insights into factors impacting
RAG performance like size, retrieval mechanism, prompting and fine-tuning. Most
notably, our findings show that choosing the right retrieval algorithms often
leads to bigger performance gains than simply using a larger language model.

我们提出了一种新的方法来衡量检索增强的大型语言模型（RAG）的任务特定准确性。通过对与任务相关的文档语料库基于多项选择问题评分的自动生成合成考试来进行评估。我们的方法是自动化、成本高效、可解释和稳健的选择 RAG 系统的最佳组件的策略。我们利用项目反应理论（IRT）估计考试的质量和信息量，以提高任务特定准确性。我们在四个基于 Arxiv 摘要、StackExchange 问题、AWS DevOps 故障排除指南和 SEC 文件的新型开放问答任务上演示了我们的方法。此外，我们的实验揭示了影响 RAG 性能的更一般的因素，如大小、检索机制、提示和微调。最重要的是，我们的研究结果表明，选择正确的检索算法通常比仅仅使用更大的语言模型能够带来更大的性能收益。

使用任务特定的考试生成自动评估检索增强的语言模型

Automated Evaluation of Retrieval-Augmented Language Models with  Task-Specific Exam Generation

This research article highlights the potential of AI-powered chatbots in
education and presents the results of using ChatGPT, a large language model, to
complete the Vietnamese National High School Graduation Examination (VNHSGE).
The study dataset included 30 essays in the literature test case and 1,700
multiple-choice questions designed for other subjects. The results showed that
ChatGPT was able to pass the examination with an average score of 6-7,
demonstrating the technology's potential to revolutionize the educational
landscape. The analysis of ChatGPT performance revealed its proficiency in a
range of subjects, including mathematics, English, physics, chemistry, biology,
history, geography, civic education, and literature, which suggests its
potential to provide effective support for learners. However, further research
is needed to assess ChatGPT performance on more complex exam questions and its
potential to support learners in different contexts. As technology continues to
evolve and improve, we can expect to see the use of AI tools like ChatGPT
become increasingly common in educational settings, ultimately enhancing the
educational experience for both students and educators.

研究探讨了聊天机器人在教育中的潜力，使用大型自然语言模型 ChatGPT 进行了越南高中毕业考试的实验，并展示了模型在文学、数学、英语、物理、化学、生物、历史、地理和公民教育等多个科目上的良好表现，表明人工智能工具在教育领域中的应用前景广阔。

ChatGPT 能否通过越南高中毕业考试？

Can ChatGPT pass the Vietnamese National High School Graduation  Examination?

We present the first study to investigate Large Language Models (LLMs) in
answering radiation oncology physics questions. Because popular exams like AP
Physics, LSAT, and GRE have large test-taker populations and ample test
preparation resources in circulation, they may not allow for accurately
assessing the true potential of LLMs. This paper proposes evaluating LLMs on a
highly-specialized topic, radiation oncology physics, which may be more
pertinent to scientific and medical communities in addition to being a valuable
benchmark of LLMs. We developed an exam consisting of 100 radiation oncology
physics questions based on our expertise at Mayo Clinic. Four LLMs, ChatGPT
(GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against
medical physicists and non-experts. ChatGPT (GPT-4) outperformed all other LLMs
as well as medical physicists, on average. The performance of ChatGPT (GPT-4)
was further improved when prompted to explain first, then answer. ChatGPT
(GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices
across a number of trials, whether correct or incorrect, a characteristic that
was not observed in the human test groups. In evaluating ChatGPTs (GPT-4)
deductive reasoning ability using a novel approach (substituting the correct
answer with "None of the above choices is the correct answer."), ChatGPT
(GPT-4) demonstrated surprising accuracy, suggesting the potential presence of
an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall,
its intrinsic properties did not allow for further improvement when scoring
based on a majority vote across trials. In contrast, a team of medical
physicists were able to greatly outperform ChatGPT (GPT-4) using a majority
vote. This study suggests a great potential for LLMs to work alongside
radiation oncology experts as highly knowledgeable assistants.

本研究研究使用 LLMS 答题的能力。我们开发了一个包含 100 个肿瘤放射物理学问题的考试，将四个 LLM（ChatGPT（GPT-3.5），ChatGPT（GPT-4），巴德（LaMDA）和 BLOOMZ）与医学物理学家和非专业人员进行了评估。ChatGPT（GPT-4）平均表现优于所有其他 LLM 以及医学物理学家。ChatGPT（GPT-4）在被激发先解释，然后再回答的情况下表现得更好。ChatGPT（GPT-4）展示了出人意料的准确性，表明了一种新颖的推理能力，但存在固有属性以及无法通过大多数投票进一步提高得分。