Conversation agents powered by large language models are revolutionizing the way we interact with visual data. Recently, large vision-language models (LVLMs) have been extensively studied for both images and videos. However, these studies typically focus on common scenarios. In this work, we introduce an LVLM specifically designed for surgical scenarios. We integrate visual representations of surgical images and videos into the language feature space. Consequently, we establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios. Our experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts, occasionally displaying multi-modal behaviors on unseen instructions. We conduct a quantitative evaluation of visual question-answering datasets for surgical scenarios. The results show superior performance compared to previous works, indicating the potential of our model to tackle more complex surgery scenarios.

本研究解决了当前较少关注手术场景的语言视觉模型的局限性，提出了一种专门设计的外科场景大语言视觉模型Surgical-LLaVA。该模型通过结合手术图像和视频的视觉表示与语言特征空间，展示了在手术背景下令人印象深刻的多模态聊天能力，且在复杂手术场景中表现出优越的性能。

外科场景理解的手术-LLaVA：利用大型语言和视觉模型