Visual Question Answering (VQA) within the surgical domain, utilizing Large
Language Models (LLMs), offers a distinct opportunity to improve
intra-operative decision-making and facilitate intuitive surgeon-AI
interaction. However, the development of LLMs for surgical VQA is hindered by
the scarcity of diverse and extensive datasets with complex reasoning tasks.
Moreover, contextual fusion of the image and text modalities remains an open
research challenge due to the inherent differences between these two types of
information and the complexity involved in aligning them. This paper introduces
PitVQA, a novel dataset specifically designed for VQA in endonasal pituitary
surgery and PitVQA-Net, an adaptation of the GPT2 with a novel image-grounded
text embedding for surgical VQA. PitVQA comprises 25 procedural videos and a
rich collection of question-answer pairs spanning crucial surgical aspects such
as phase and step recognition, context understanding, tool detection and
localization, and tool-tissue interactions. PitVQA-Net consists of a novel
image-grounded text embedding that projects image and text features into a
shared embedding space and GPT2 Backbone with an excitation block
classification head to generate contextually relevant answers within the
complex domain of endonasal pituitary surgery. Our image-grounded text
embedding leverages joint embedding, cross-attention and contextual
representation to understand the contextual relationship between questions and
surgical images. We demonstrate the effectiveness of PitVQA-Net on both the
PitVQA and the publicly available EndoVis18-VQA dataset, achieving improvements
in balanced accuracy of 8% and 9% over the most recent baselines, respectively.
Our code and dataset is available at this https URL

本文提出了 PitVQA 和 PitVQA-Net，通过图像和文本信息的联合嵌入和上下文表示，解决了对内窥镜垂体手术领域中复杂的问答任务的挑战，并在 PitVQA 和 EndoVis18-VQA 数据集上取得了显著性能改进。

PitVQA: 基于图像引导的文本嵌入 LLM 用于垂体手术的视觉问答

PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering  in Pituitary Surgery

Modern operating room is becoming increasingly complex, requiring innovative
intra-operative support systems. While the focus of surgical data science has
largely been on video analysis, integrating surgical computer vision with
language capabilities is emerging as a necessity. Our work aims to advance
Visual Question Answering (VQA) in the surgical context with scene graph
knowledge, addressing two main challenges in the current surgical VQA systems:
removing question-condition bias in the surgical VQA dataset and incorporating
scene-aware reasoning in the surgical VQA model design. First, we propose a
Surgical Scene Graph-based dataset, SSG-QA, generated by employing segmentation
and detection models on publicly available datasets. We build surgical scene
graphs using spatial and action information of instruments and anatomies. These
graphs are fed into a question engine, generating diverse QA pairs. Our SSG-QA
dataset provides a more complex, diverse, geometrically grounded, unbiased, and
surgical action-oriented dataset compared to existing surgical VQA datasets. We
then propose SSG-QA-Net, a novel surgical VQA model incorporating a lightweight
Scene-embedded Interaction Module (SIM), which integrates geometric scene
knowledge in the VQA model design by employing cross-attention between the
textual and the scene features. Our comprehensive analysis of the SSG-QA
dataset shows that SSG-QA-Net outperforms existing methods across different
question types and complexities. We highlight that the primary limitation in
the current surgical VQA systems is the lack of scene knowledge to answer
complex queries. We present a novel surgical VQA dataset and model and show
that results can be significantly improved by incorporating geometric scene
features in the VQA model design. The source code and the dataset will be made
publicly available at: this https URL

通过使用场景图知识解决目前手术 VQA 系统中的问题条件偏见和融入场景感知推理两个挑战，我们提出了一种基于手术场景图的数据集 SSG-QA 和一种新颖的手术 VQA 模型 SSG-QA-Net，展示了通过将几何场景特征融入 VQA 模型设计中能够显著提高结果。