The integration of Artificial Intelligence (AI), especially Large Language Models (LLMs), into the clinical diagnosis process offers significant potential to improve the efficiency and accessibility of medical care. While LLMs have shown some promise in the medical domain, their application in clinical diagnosis remains underexplored, especially in real-world clinical practice, where highly sophisticated, patient-specific decisions need to be made. Current evaluations of LLMs in this field are often narrow in scope, focusing on specific diseases or specialties and employing simplified diagnostic tasks. To bridge this gap, we introduce CliBench, a novel benchmark developed from the MIMIC IV dataset, offering a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis. This benchmark not only covers diagnoses from a diverse range of medical cases across various specialties but also incorporates tasks of clinical significance: treatment procedure identification, lab test ordering and medication prescriptions. Supported by structured output ontologies, CliBench enables a precise and multi-granular evaluation, offering an in-depth understanding of LLM's capability on diverse clinical tasks of desired granularity. We conduct a zero-shot evaluation of leading LLMs to assess their proficiency in clinical decision-making. Our preliminary results shed light on the potential and limitations of current LLMs in clinical settings, providing valuable insights for future advancements in LLM-powered healthcare.

将人工智能（AI）与临床诊断过程中的大型语言模型（LLMs）整合，可以显著提高医疗保健的效率和可访问性。本研究通过引入CliBench——一个基于MIMIC IV数据集的新型基准测试，评估LLMs在临床诊断中的能力，覆盖了多种临床病例的诊断，并包括与临床有关的任务，如治疗程序识别、实验室检查和药物处方等。通过结构化输出本体，CliBench能够深入了解LLMs在不同临床任务上的能力，从而为LLM在医疗保健领域的未来发展提供有价值的见解。

CliBench: 临床决策中大型语言模型在诊断、过程、实验室测试和处方方面的多方面评估