Large Vision-Language Models offer a new paradigm for AI-driven image understanding, enabling models to perform tasks without task-specific training. This flexibility holds particular promise across medicine, where expert-annotated data is scarce. Yet, VLMs' practical utility in intervention-focused domains--especially surgery, where decision-making is subjective and clinical scenarios are variable--remains uncertain. Here, we present a comprehensive analysis of 11 state-of-the-art VLMs across 17 key visual understanding tasks in surgical AI--from anatomy recognition to skill assessment--using 13 datasets spanning laparoscopic, robotic, and open procedures. In our experiments, VLMs demonstrate promising generalizability, at times outperforming supervised models when deployed outside their training setting. In-context learning, incorporating examples during testing, boosted performance up to three-fold, suggesting adaptability as a key strength. Still, tasks requiring spatial or temporal reasoning remained difficult. Beyond surgery, our findings offer insights into VLMs' potential for tackling complex and dynamic scenarios in clinical and broader real-world applications.

本研究针对外科领域中大型视觉-语言模型（VLMs）在图像理解任务中的实际应用进行了深入分析，弥补了相关文献中对其效用研究的不足。研究发现，VLMs展示出良好的泛化能力，尤其是在使用上下文学习时性能提升显著，表明其适应性是一个关键优势；然而，在空间或时间推理任务上的表现仍然较弱，这为今后在临床及其他实际场景中的应用提供了重要见解。

外科学人工智能中大型视觉-语言模型的系统评估