Curated datasets for healthcare are often limited due to the need of human annotations from experts. In this paper, we present MedEval, a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare. MedEval is comprehensive and consists of data from several healthcare systems and spans 35 human body regions from 8 examination modalities. With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data and supporting a wide range of tasks. Moreover, we systematically evaluated 10 generic and domain-specific language models under zero-shot and finetuning settings, from domain-adapted baselines in healthcare to general-purposed state-of-the-art large language models (e.g., ChatGPT). Our evaluations reveal varying effectiveness of the two categories of language models across different tasks, from which we notice the importance of instruction tuning for few-shot usage of large language models. Our investigation paves the way toward benchmarking language models for healthcare and provides valuable insights into the strengths and limitations of adopting large language models in medical domains, informing their practical applications and future advancements.

为了促进医疗语言模型的发展，本文介绍了一个多层次、多任务和多领域的医疗基准数据集MedEval，包含来自多个医疗系统的数据，跨越了8种检查模式的35个人体区域。我们对10个通用和领域特定的语言模型进行了系统评估，并发现语言模型在不同任务上的效果不同。同时，我们强调了对少样本使用大型语言模型进行指导调整的重要性。研究结果为医疗领域的语言模型基准测试提供了参考，并深入探讨了采用大型语言模型在医疗领域的优势和局限性，为其实际应用和未来发展提供了重要启示。

MedEval：多层次、多任务、多领域的医学文本模型评估基准