LLMs have demonstrated impressive performance in answering medical questions,
such as passing medical licensing examinations. However, most existing
benchmarks rely on board exam questions or general medical questions, falling
short in capturing the complexity of realistic clinical cases. Moreover, the
lack of reference explanations for answers hampers the evaluation of model
explanations, which are crucial to supporting doctors in making complex medical
decisions. To address these challenges, we construct two new datasets: JAMA
Clinical Challenge and Medbullets. JAMA Clinical Challenge consists of
questions based on challenging clinical cases, while Medbullets comprises USMLE
Step 2&3 style clinical questions. Both datasets are structured as
multiple-choice question-answering tasks, where each question is accompanied by
an expert-written explanation. We evaluate four LLMs on the two datasets using
various prompts. Experiments demonstrate that our datasets are harder than
previous benchmarks. The inconsistency between automatic and human evaluations
of model-generated explanations highlights the need to develop new metrics to
support future research on explainable medical QA.

通过构建两个新的数据集，利用多个评估指标以及医学专家编写的解释进行实验，我们发现 LLMs 在回答医学问题方面表现出色，但是现有的基准测试数据集在捕捉真实临床病例的复杂性以及提供参考解释方面存在不足，因此需要开发新的度量指标以支持可解释医疗问答的未来研究。