Large language models (LLMs) have shown remarkable performance on many tasks in different domains. However, their performance in closed-book biomedical machine reading comprehension (MRC) has not been evaluated in depth. In this work, we evaluate GPT on four closed-book biomedical MRC benchmarks. We experiment with different conventional prompting techniques as well as introduce our own novel prompting method. To solve some of the retrieval problems inherent to LLMs, we propose a prompting strategy named Implicit Retrieval Augmented Generation (RAG) that alleviates the need for using vector databases to retrieve important chunks in traditional RAG setups. Moreover, we report qualitative assessments on the natural language generation outputs from our approach. The results show that our new prompting technique is able to get the best performance in two out of four datasets and ranks second in rest of them. Experiments show that modern-day LLMs like GPT even in a zero-shot setting can outperform supervised models, leading to new state-of-the-art (SoTA) results on two of the benchmarks.

我们评估了GPT在四个封闭式生物医学机器阅读理解基准测试上的表现，提出了一种名为Implicit Retrieval Augmented Generation（IRAG）的提示策略，该策略通过减少传统RAG设置中使用向量数据库检索重要部分的需求来解决LLM所固有的检索问题，并通过定性评估展示了该方法的自然语言生成输出。实验结果表明，我们的新提示技术在四个数据集中有两个取得了最佳效果，并在其余两个中排名第二。实验还表明，像GPT这样的现代LLM，即使在零-shot设置中，也能胜过监督模型，从而在两个基准测试中取得了最新技术水平的成果。

GPT能否重新定义医学认识？对生物医学机器阅读理解中的GPT进行评估