We investigate whether general-domain large language models such as GPT-4 Turbo can perform risk stratification and predict post-operative outcome measures using a description of the procedure and a patient's clinical notes derived from the electronic health record. We examine predictive performance on 8 different tasks: prediction of ASA Physical Status Classification, hospital admission, ICU admission, unplanned admission, hospital mortality, PACU Phase 1 duration, hospital duration, and ICU duration. Few-shot and chain-of-thought prompting improves predictive performance for several of the tasks. We achieve F1 scores of 0.50 for ASA Physical Status Classification, 0.81 for ICU admission, and 0.86 for hospital mortality. Performance on duration prediction tasks were universally poor across all prompt strategies. Current generation large language models can assist clinicians in perioperative risk stratification on classification tasks and produce high-quality natural language summaries and explanations.

探讨了通用领域大型语言模型（如GPT-4 Turbo）能否通过手术过程描述和患者临床记录从电子健康记录中进行风险分层和预测术后结果指标。 对8个不同任务的预测性能进行了研究：ASA身体状态分类的预测、住院、重症监护室入院、非计划入院、住院死亡、PACU第一阶段持续时间、住院持续时间和重症监护室持续时间。 少量样本和思维链调控提高了几个任务的预测性能。 ASA身体状态分类的F1分数为0.50，重症监护室入院为0.81，住院死亡为0.86。 在所有提示策略中，持续时间预测任务的性能普遍较差。 当前一代大型语言模型能够协助临床医生进行围手术期风险分层的分类任务，并产生高质量的自然语言摘要和解释。

大型语言模型在围手术期风险预测和预测中的能力